ubuntu18.04下RTX3090环境配置+CUDA11.1+cudnn8.0.5+tf2.4.0

之前在2080TI上用deeplabcut跑动物轨迹识别,最近忍不住入坑了3090,下面是RTX3090搭建环境的过程。(坑真的超级多!!)

系统:ubuntu18.04
显卡:RTX3090
CUDA:11.1
cudnn:对应CUDA11.1(8.0.5)
tensorflow:2.4.0(2.4.1也行)
python:3.7

因为之前已经习惯用conda安装软件了,但是cuda11.1不支持conda,所以只能去官网下载。但是我们一样需要用conda的虚拟环境,方便我们调试版本。

一、NVIDIA驱动安装

1.建议自己下载驱动安装,我的3090用的驱动是460.56:

官方460.56驱动

sudo chmod 777 NVIDIA-Linux-x86_64-460.56.run
sudo ./NVIDIA-Linux-x86_64-460.56.run

*如果这里安装了驱动,后面CUDA安装过程中,把455驱动【X】改为【】,意思是不从CUDA安装包里安装驱动。
如果是其他显卡可以在官网搜索相关支持版本。
查看安装是否成功:

nvidia-smi
Fri Mar 19 13:30:00 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56       Driver Version: 460.56       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 3090    Off  | 00000000:01:00.0 Off |                  N/A |
| 39%   53C    P0    72W / 370W |      0MiB / 24245MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

二、CUDA11.1安装

1.同上,先chmod给权限,然后sudo安装

CUDA11.1,ubuntu18.04,runfile 这个链接是我用的runfile文件下载链接,需要其他版本的可以自行选择下载。
这里CUDA文件上写的是455.23的驱动,但是我们已经打过最新驱动了,所以选择不安装。

sudo ./cuda_11.1.0_455.23.05_linux.run

CUDA安装在/usr/local/下的cuda-11.1,可以进入查看

2.添加环境变量

进入.bashrc

sudo vim ~/.bashrc

文件末尾添加:

export CUDA_HOME=/usr/local/cuda
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${CUDA_HOME}/lib64
export PATH=${CUDA_HOME}/bin:${PATH}

或者添加我们对应的版本号:

export PATH="/usr/local/cuda-11.1/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-11.1/lib64:$LD_LIBRARY_PATH"

然后一定要source(相当于立刻激活)

sudo source ~/.bashrc

查看是否安装成功

nvcc -V
(tf2.4.1) hjh@hjhPC:/usr/local$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Tue_Sep_15_19:10:02_PDT_2020
Cuda compilation tools, release 11.1, V11.1.74
Build cuda_11.1.TC455_06.29069683_0

三、CUDNN安装

1.cudnn下载

nvidia官网Cudnn 如果没有账号就根据提示注册一个,然后下载相应版本的cudnn。

2.当前目录解压

不要在cuda11.1的目录解压!因为Cudnn解压出来的目录也叫cuda

tar -zxvf cudnn-11.1-linux-x64-v8.0.5.39.tgz
3.把解压出的文件copy到cuda中对应文件夹
sudo cp cuda/lib64/* /usr/local/cuda/lib64/
sudo cp cuda/include/* /usr/local/cuda/include/
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

注意!/usr/local/ 下有一个cuda和一个cuda11.1,放在哪个都行。cuda大概是cuda11.1的软链接。

4.验证安装
cat /usr/local/cuda/include/cudnn_version.h | grep CUDNN_MAJOR -A 2

(这一步网上找的,但是可能新版Cudnn这个文件里面没有这一项了,我没成功,实际上是装好了的。)

四、tensorflow2.4.0安装

这里的tf2.4.0是完全可以用在GPU上的,大家不用担心,反正我没有特意下载GPU版本的tensorflow。用的清华源,命令:

pip install tensorflow==2.4.0 -i https://pypi.tuna.tsinghua.edu.cn/simple

五、deeplabcutcore安装(单纯配置RTX3090炼丹环境的可以跳过)

因为DLC不支持tf2.X,这里用DLCgithub上的解决办法,拿deeplabcutcore跑训练。安装:

pip install deeplabcutcore
pip install tf_slim

记住!!!一定要用python3.7安装和运行!!!,其他版本都会报各种致命错误!我是用conda创建了python3.7的虚拟环境,然后用pip(3.7py)装的。还有什么问题可以到github上的DLC社区询问~,大家都很耐心。
—————————————————————————
结果:

Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
2021-03-19 13:33:18.111767: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
>>> import deeplabcutcore
>>>

跑起来速度一开始有点慢,到后面速度是超过2080TI的,大概一秒60个迭代~。GPU占用也在95%左右浮动,应该算是安装成功了!

—————————————————————————
一些坑:
1.Not creating XLA devices, tf_xla_enable_xla_devices not set

nitializing ResNet
/home/hjh/miniconda3/envs/tf2.4.1/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer_v1.py:1719: UserWarning: `layer.apply` is deprecated and will be removed in a future version. Please use `layer.__call__` method instead.
  warnings.warn('`layer.apply` is deprecated and '
WARNING:tensorflow:From /home/hjh/miniconda3/envs/tf2.4.1/lib/python3.7/site-packages/deeplabcutcore/pose_estimation_tensorflow/nnet/losses.py:38: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
2021-03-19 09:57:14.579849: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-03-19 09:57:14.580951: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.1/lib64:
2021-03-19 09:57:14.580964: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2021-03-19 09:57:14.580975: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (hjhPC): /proc/driver/nvidia/version does not exist
2021-03-19 09:57:14.582528: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
Loading ImageNet-pretrained resnet_50
2021-03-19 09:57:14.881725: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-03-19 09:57:15.070222: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:196] None of the MLIR optimization passes are enabled (registered 0 passes)
2021-03-19 09:57:15.119667: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2899885000 Hz
INFO:tensorflow:Restoring parameters from /home/hjh/miniconda3/envs/test2/lib/python3.7/site-packages/deeplabcut/pose_estimation_tensorflow/models/pretrained/resnet_v1_50.ckpt
Training parameter:
Starting training....
iteration: 50 loss: 0.1026 lr: 0.005
iteration: 100 loss: 0.0309 lr: 0.005
iteration: 150 loss: 0.0269 lr: 0.005
iteration: 200 loss: 0.0265 lr: 0.005

2.Please also try adding directory that contains libnvidia-ml.so to your system PATH

Please also try adding directory that contains libnvidia-ml.so to your system PATH

上面两个都是GPU驱动没打好,跑起来用得CPU,很慢,而且CPU单核占用1000%了,GPU还是0%。(输入top查看CPU占用)

3.如果提示无法成功打开library libcudart.so.11.0,检查一下环境变量
4.反复出现

successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

并不会导致错误,它只是在寻找其他GPU。我一开始感觉迭代跟2080TI比跑得并没有快太多,误会了以为这个是报错,其实GPU已经在满载运行了。建议即时查看GPU使用情况

watch -n 2 nvidia-smi

其实还碰到了很多其他坑,但是解决的时候没有及时把错误和解决方法记录下来。后面找到或者碰到了再补吧~