pytorch 制定device pytorch reserved

转载

蓝月亮 2023-06-20 21:28:02

文章标签 pytorch 制定device 解决方法 Memory 自定义 文章分类 PyTorch 人工智能

感谢大佬分享经验！！！

1、安装完成后，不能import torch，提示 ImportError: dlopen: cannot load any more object with static TLS

解决办法：有很多答案都说是将import torch放在import cv2之前，但我试了之后还是不能解决，最后是通过在jupyter notebook中可以直接import torch。我是通过mobarxterm连接实验室的服务器，在console下以及spyder下均不能import torch，只有在jupyter下可以。

更新：也可通过修改backend解决。

2、对两个variable进行concat操作，按道理实现方式是c = torch.cat([a, b], dim=0)，但提示错误

TypeError: cat received an invalid combination of arguments - got (tuple, int), but expected one of:

*  (sequence[torch.cuda.FloatTensor] tensors)
* (sequence[torch.cuda.FloatTensor] tensors, int dim)
    didn’t match because some of the arguments have invalid types: (tuple, int)

解决办法：根据提示刚开始以为是cat不接受tuple作为输入，然而真正的问题在于a和b的type不一样，比如可能出现a是torch.cuda.DoubleTensor而b是torch.cuda.FloatTensor，因此，将a和b转换为相同的type即可。

3、模型训练时提示 RuntimeError: tensors are on different GPUs

这个问题出现的原因在于训练数据data或者模型model其中有一个是*.cuda()，而另一个不是。全都改为data.cuda()和model.cuda()即可

解决办法：

data = data.cuda()
	model = model.cuda()

4、模型训练时提示 TypeError: argument 0 is not a Variable

原因在于输入data不是Variable，需转化成Variable格式。

解决办法：

from torch.autograd import Variable
data = Variable(data).cuda()

5、自定义Loss训练时提示 AttributeError: ‘MyLoss’ object has no attribute ‘_forward_pre_hooks’

根据题感觉像是loss在forward之前出错了，关于pytorch如何自定义loss可以参考这里。

解决办法：在loss初始化函数里加入 super(MyLoss, self).init()

6、训练过程没有问题，验证是提示CUDA Error：Out of Memory

提示是Memory的问题，第一反应是降低batch size大小，据说是有用的，但我试着将batch size降为1，仍然不行。再考虑其他办法，发现在定义Variable时，没有限制不求梯度（比如输入的input和target并不需要求梯度），根据搜索，有两种方法：一是采用requires_grad=False，另一种是使用volatile=True，一般推荐使用第二种。但我用的是Pytorch的0.4版本，volatile不再支持。

解决方法：用with torch.no_grad()替代volition。即如果源代码为

target_var = torch.autograd.Variable(target.cuda(async=True))

如果用0.4之前的版本可采用

target_var = torch.autograd.Variable(target.cuda(async=True),volatile=True)

如果0.4之后的版本，可采用

with torch.no_grad()
   target_var = torch.autograd.Variable(target.cuda(async=True),volatile=True)

问题基本解决。如果还有问题，那可能出在代码中可能出现了反复叠加的操作，比如acc的叠加，或者loss 的叠加，将loss中的data提取出，并且记得用完之后del即可。

7、提示‘BatchNorm2d’ object has no attribute ‘track_running_stats’错误

pytorch 0.4 不支持，由于版本不对应而出现的问题。

解决方法：更换pytorch版本，如降低至pytorch 0.3版本。

8、提示“Expected object of type torch.DoubleTensor but found type torch.FloatTensor for argument #2 ‘weight’”

解决方法：添加model.double()即可

9、提示Expected object of type torch.DoubleTensor but found type torch.cuda.DoubleTensor for argument #2 ‘weight’

之前的写法是inputs.cuda(), outputs.cuda()

解决方法：改写为inputs=inputs.cuda(). outputs=outputs.cuda()

10、Debug时候卡在第一个epoch，但run时没有任何问题。

解决方法：将dataloader的num_works设置为1即可

11、RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

出现问题的原因是Train的代码中至少调用了两次loss.backward()

解决办法：在第二次调用loss.backward()之前更新output，即在loss.backward()前添加output = model(input)

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。