1.需要注意的地方torch版本问题

要求torch>=1.1,同时torch跟cuda有版本对应限制,这个一定要注意。我先安装的0.4.0,不行,又装了1.4.0,但是版本要求的cuda太高了,我又重新装回到1.0.0,正好对应我的cuda-9.0,万万没想到程序要求的是大于1.1.0。耗费两三个小时。

cuda与pytorch版本不匹配问题,先看cuda是什么版本的,再去找对应的torch版本进行安装。

​​

当然在这个过程中,很可能会碰上这个bug,因为网速太慢,而导致下载失败

Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/pip/_vendor/urllib3/response.py", line 397, in _error_catcher
yield
File "/usr/local/lib/python3.5/dist-packages/pip/_vendor/urllib3/response.py", line 479, in read
data = self._fp.read(amt)
File "/usr/local/lib/python3.5/dist-packages/pip/_vendor/cachecontrol/filewrapper.py", line 62, in read
data = self.__fp.read(amt)
File "/usr/lib/python3.5/http/client.py", line 458, in read
n = self.readinto(b)
File "/usr/lib/python3.5/http/client.py", line 498, in readinto
n = self.fp.readinto(b)
File "/usr/lib/python3.5/socket.py", line 575, in readinto
return self._sock.recv_into(b)
File "/usr/lib/python3.5/ssl.py", line 929, in recv_into
return self.read(nbytes, buffer)
File "/usr/lib/python3.5/ssl.py", line 791, in read
return self._sslobj.read(len, buffer)
File "/usr/lib/python3.5/ssl.py", line 575, in read
v = self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/pip/_internal/cli/base_command.py", line 188, in main
status = self.run(options, args)
File "/usr/local/lib/python3.5/dist-packages/pip/_internal/commands/install.py", line 345, in run
resolver.resolve(requirement_set)
File "/usr/local/lib/python3.5/dist-packages/pip/_internal/legacy_resolve.py", line 196, in resolve
self._resolve_one(requirement_set, req)
File "/usr/local/lib/python3.5/dist-packages/pip/_internal/legacy_resolve.py", line 359, in _resolve_one
abstract_dist = self._get_abstract_dist_for(req_to_install)
File "/usr/local/lib/python3.5/dist-packages/pip/_internal/legacy_resolve.py", line 307, in _get_abstract_dist_for
self.require_hashes
File "/usr/local/lib/python3.5/dist-packages/pip/_internal/operations/prepare.py", line 199, in prepare_linked_requirement
progress_bar=self.progress_bar
File "/usr/local/lib/python3.5/dist-packages/pip/_internal/download.py", line 1064, in unpack_url
progress_bar=progress_bar
File "/usr/local/lib/python3.5/dist-packages/pip/_internal/download.py", line 924, in unpack_http_url
progress_bar)
File "/usr/local/lib/python3.5/dist-packages/pip/_internal/download.py", line 1152, in _download_http_url
_download_url(resp, link, content_file, hashes, progress_bar)
File "/usr/local/lib/python3.5/dist-packages/pip/_internal/download.py", line 861, in _download_url
hashes.check_against_chunks(downloaded_chunks)
File "/usr/local/lib/python3.5/dist-packages/pip/_internal/utils/hashes.py", line 75, in check_against_chunks
for chunk in chunks:
File "/usr/local/lib/python3.5/dist-packages/pip/_internal/download.py", line 829, in written_chunks
for chunk in chunks:
File "/usr/local/lib/python3.5/dist-packages/pip/_internal/utils/ui.py", line 156, in iter
for x in it:
File "/usr/local/lib/python3.5/dist-packages/pip/_internal/download.py", line 818, in resp_read
decode_content=False):
File "/usr/local/lib/python3.5/dist-packages/pip/_vendor/urllib3/response.py", line 531, in stream
data = self.read(amt=amt, decode_content=decode_content)
File "/usr/local/lib/python3.5/dist-packages/pip/_vendor/urllib3/response.py", line 496, in read
raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
File "/usr/lib/python3.5/contextlib.py", line 77, in __exit__
self.gen.throw(type, value, traceback)
File "/usr/local/lib/python3.5/dist-packages/pip/_vendor/urllib3/response.py", line 402, in _error_catcher
raise ReadTimeoutError(self._pool, None, 'Read timed out.')
pip._vendor.urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='pypi.tuna.tsinghua.edu.cn', port=443): Read timed out.

另外也有时间限制,把时间限制从100改成1000就可以了。

解决方法如下所示:

sudo /usr/bin/python3.5 -m pip install  --default-timeout=1000 --no-cache-dir torch==1.1.0 -i https://pypi.tuna.tsinghua.edu.cn/simple

​清华源不行就换成阿里源,下个破torch折腾了三个小时了 : ​

sudo /usr/bin/python3.5 -m pip install  --default-timeout=1000 --no-cache-dir torch==1.1.0 -i https://mirrors.aliyun.com/pypi/simple


2.error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

看bug来说应该是有些辅助库没有装上

error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

​​

看了很多方法,都是采用这一条就可以

sudo apt-get install python3.5-dev

我的程序采用之后无法解决,中间的版本号根据自己的python版本来改动。


我在采用上述方法之后,然后下载了requirements文件夹里面所有要求的python库之后,运行就能通过了。可能不光ubuntu系统库缺东西,还有python有些库缺东西。

运行mmdetection遇到的坑_python


3.error: [Errno 13] Permission denied: '/usr/local/lib/python3.5/dist-packages/easy-install.pth'

这个很简单,直接chmod就可以

sudo chmod 777 /usr/local/lib/python3.5/dist-packages
sudo chmod 777 /usr/local/lib/python3.5/dist-packages/*

4.error: [Errno 13] Permission denied: '/usr/local/bin/convert-onnx-to-caffe2'

解决:

sudo chmod 777 /usr/local/bin


5.解决fatal error: torch/extension.h: No such file or directory

这个问题很典型的就是torch与cuda版本不匹配导致的。

就是torch可能是0.4.0的,升级一下,升到1.1.0


6.from .. import deform_conv_cuda

这个问题十分糟心,运行test.py疯狂报这个错,我一直没解决好,我看了其他人的blog上面写的是:

python setup.py develop

这样就可以了,但是我运行setup.py之后,就会报另一个很蛋疼的错误,也就是下一个问题。解决方法也在下面说。


7.‘:/usr/local/cuda-10.0/nvcc': No such file or directory

按照报错的意思来说,就是cuda文件夹里面找不到这个nvcc,但是我发现确实有这个nvcc文件,但是现实路径不对。

这个解决方法按照别人的方法来也很简单,就是直接改bashrc文件就好。

打开终端:

export CUDA_HOME=/usr/local/cuda-9.0
source ~/.bashrc

按照别人讲的,就这样就可以了,但是我的改了他妈的一天都不对,直接想弃坑了。因为我的服务器上装了cuda10.0跟10.1,有时候地址写的是10.0,有时候写的是10.1,这就很蛋疼。

你再看那个引号里面的路径,是多了一个冒号:。根据别人的做法,删掉冒号,重新输入地址,看似是对的,但是我的就是不行呢!!!!然后我用的一直是实验室的服务器,我在家里用shell一直连不上,疯狂从终端输入./pycharm.h就是没反应,然后又是查pycharm的运行号,kill掉进程。这个过程很烦躁。

最后解决的方法,就是重启了一个工程,然后开始下载python3.5的虚拟环境,把requirements文件夹下面的所有库都下载一遍,再运行setup.py develop就可以了,然后test.py也可以正常运行。。。。NMSL...

我到现在也不明白为什么大家都可以的方法,碰到我的电脑上就是行不通。


运行到此处,setup.py就可以完美运行了。

然后运行test.py文件,效果图就是这样,简单测了一下,用的model是mask scoring_rcnn_x101_64*4d_fpn_1x:

运行mmdetection遇到的坑_javascript_02

把这么瘦的我拍的这么胖


在训练自己的数据库遇到的问题:

8.AssertionError: annotation file format <class 'list'> not supported

我在train的时候,转为自己的数据,同时把json文件地址进行修改:

/usr/bin/python3.5 tools/train_chicken.py configs/mask_rcnn_x101_64x4d_fpn_1x.py --gpus 1 --validate --work_dir /work_dir

其中json文件地址在configs/mask_rcnn_x101_64x4d_fpn_1x.py修改,具体位置:

data = dict(
imgs_per_gpu=2,
workers_per_gpu=2,
train=dict(
type=dataset_type,
ann_file=data_root + 'annotations/train_via_region_data.json',
img_prefix=data_root + 'train2017/',
pipeline=train_pipeline),
val=dict(
type=dataset_type,
ann_file=data_root + 'annotations/val_via_region_data.json',
img_prefix=data_root + 'val2017/',
pipeline=test_pipeline),
test=dict(
type=dataset_type,
ann_file=data_root + 'annotations/test_via_region_data.json',
img_prefix=data_root + 'val2017/',
pipeline=test_pipeline))

报错显示我的annotations有问题。我打算先下载coco数据库的json文件,对比一下二者之间的区别


原因是我在使用json文件时,我的图片数据库的裁剪方式为多边形,而coco的是矩形,只存四个点的位置信息,换成矩形就可以了。

​​

9.KeyError: 'categories'

还是json的问题,换为矩形之后,报错显示类别错误,目前正在解决,需要修改coco的categories,或者把自己的类别改为coco81中种类其中的一种,我原来的类别是chicken,coco里没有,我就改成bird了。

还有要在json里面加上categories类别。

运行mmdetection遇到的坑_javascript_03

修改mmdetection/mmdet/core/evaluation下的class_names.py中的voc_classes,将其改为要训练的数据集的类别名称。注意的是,如果类别只有一个,也是要加上逗号的,否则会报错,如下:


运行mmdetection遇到的坑_javascript_04


我上传的网页能够制作coco数据集,可以使用就行了。生成coco数据及之后,就能正常运行。

运行mmdetection遇到的坑_javascript_05

10.ValueError: need at least one array to concatenate


制作完毕coco数据集之后,运行以下命令报错:

/usr/bin/python3.5 tools/train_chicken.py configs/mask_rcnn_x101_64x4d_fpn_1x.py

报错为:

Traceback (most recent call last):
File "tools/train_chicken.py", line 142, in <module>
main()
File "tools/train_chicken.py", line 138, in main
meta=meta)
File "/home/zlee/下载/mmdetection-master/mmdet/apis/train.py", line 111, in train_detector
meta=meta)
File "/home/zlee/下载/mmdetection-master/mmdet/apis/train.py", line 225, in _non_dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/home/zlee/.local/lib/python3.5/site-packages/mmcv/runner/runner.py", line 359, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/zlee/.local/lib/python3.5/site-packages/mmcv/runner/runner.py", line 259, in train
for i, data_batch in enumerate(data_loader):
File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 193, in __iter__
return _DataLoaderIter(self)
File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 493, in __init__
self._put_indices()
File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 591, in _put_indices
indices = next(self.sample_iter, None)
File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/sampler.py", line 172, in __iter__
for idx in self.sampler:
File "/home/zlee/下载/mmdetection-master/mmdet/datasets/loader/sampler.py", line 63, in __iter__
indices = np.concatenate(indices)
ValueError: need at least one array to concatenate

显示valueError错误。还没解决


11.KeyError: 'category_id'

一般是引用某个字段,但没有定义

json文件里面类别id没写。写上就行。

"annotations": [{
"id": 0,
"image_id": "0",
"segmentation": [47, 49, 192, 49, 192, 179, 47, 179],
"area": 18850,
"bbox": [47, 49, 145, 130],
"iscrowd": 0
},
{
"id": 1,
"image_id": "0",
"segmentation": [45, 304, 187, 304, 187, 563, 45, 563],
"area": 36778,
"bbox": [45, 304, 142, 259],
"iscrowd": 0
},

加上category_id就可以了,具体的id,需要去coco里面找,我的是bird,id是16

"annotations": [{
"id": 0,
"image_id": "0",
"segmentation": [47, 49, 192, 49, 192, 179, 47, 179],
"area": 18850,
"bbox": [47, 49, 145, 130],
"iscrowd": 0,
"category_id": 16
},
{
"id": 1,
"image_id": "0",
"segmentation": [45, 304, 187, 304, 187, 563, 45, 563],
"area": 36778,
"bbox": [45, 304, 142, 259],
"iscrowd": 0,
"category_id": 16
},