ubuntu18.04下RTX3090环境配置+CUDA11.1+cudnn8.0.5+tf2.4.0
之前在2080TI上用deeplabcut跑动物轨迹识别,最近忍不住入坑了3090,下面是RTX3090搭建环境的过程。(坑真的超级多!!)
系统:ubuntu18.04
显卡:RTX3090
CUDA:11.1
cudnn:对应CUDA11.1(8.0.5)
tensorflow:2.4.0(2.4.1也行)
python:3.7
因为之前已经习惯用conda安装软件了,但是cuda11.1不支持conda,所以只能去官网下载。但是我们一样需要用conda的虚拟环境,方便我们调试版本。
一、NVIDIA驱动安装
1.建议自己下载驱动安装,我的3090用的驱动是460.56:
sudo chmod 777 NVIDIA-Linux-x86_64-460.56.run
sudo ./NVIDIA-Linux-x86_64-460.56.run
*如果这里安装了驱动,后面CUDA安装过程中,把455驱动【X】改为【】,意思是不从CUDA安装包里安装驱动。
如果是其他显卡可以在官网搜索相关支持版本。
查看安装是否成功:
nvidia-smi
Fri Mar 19 13:30:00 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56 Driver Version: 460.56 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3090 Off | 00000000:01:00.0 Off | N/A |
| 39% 53C P0 72W / 370W | 0MiB / 24245MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
二、CUDA11.1安装
1.同上,先chmod给权限,然后sudo安装
CUDA11.1,ubuntu18.04,runfile 这个链接是我用的runfile文件下载链接,需要其他版本的可以自行选择下载。
这里CUDA文件上写的是455.23的驱动,但是我们已经打过最新驱动了,所以选择不安装。
sudo ./cuda_11.1.0_455.23.05_linux.run
CUDA安装在/usr/local/下的cuda-11.1,可以进入查看
2.添加环境变量
进入.bashrc
sudo vim ~/.bashrc
文件末尾添加:
export CUDA_HOME=/usr/local/cuda
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${CUDA_HOME}/lib64
export PATH=${CUDA_HOME}/bin:${PATH}
或者添加我们对应的版本号:
export PATH="/usr/local/cuda-11.1/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-11.1/lib64:$LD_LIBRARY_PATH"
然后一定要source(相当于立刻激活)
sudo source ~/.bashrc
查看是否安装成功
nvcc -V
(tf2.4.1) hjh@hjhPC:/usr/local$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Tue_Sep_15_19:10:02_PDT_2020
Cuda compilation tools, release 11.1, V11.1.74
Build cuda_11.1.TC455_06.29069683_0
三、CUDNN安装
1.cudnn下载
nvidia官网Cudnn 如果没有账号就根据提示注册一个,然后下载相应版本的cudnn。
2.当前目录解压
不要在cuda11.1的目录解压!因为Cudnn解压出来的目录也叫cuda
tar -zxvf cudnn-11.1-linux-x64-v8.0.5.39.tgz
3.把解压出的文件copy到cuda中对应文件夹
sudo cp cuda/lib64/* /usr/local/cuda/lib64/
sudo cp cuda/include/* /usr/local/cuda/include/
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*
注意!/usr/local/ 下有一个cuda和一个cuda11.1,放在哪个都行。cuda大概是cuda11.1的软链接。
4.验证安装
cat /usr/local/cuda/include/cudnn_version.h | grep CUDNN_MAJOR -A 2
(这一步网上找的,但是可能新版Cudnn这个文件里面没有这一项了,我没成功,实际上是装好了的。)
四、tensorflow2.4.0安装
这里的tf2.4.0是完全可以用在GPU上的,大家不用担心,反正我没有特意下载GPU版本的tensorflow。用的清华源,命令:
pip install tensorflow==2.4.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
五、deeplabcutcore安装(单纯配置RTX3090炼丹环境的可以跳过)
因为DLC不支持tf2.X,这里用DLCgithub上的解决办法,拿deeplabcutcore跑训练。安装:
pip install deeplabcutcore
pip install tf_slim
记住!!!一定要用python3.7安装和运行!!!,其他版本都会报各种致命错误!我是用conda创建了python3.7的虚拟环境,然后用pip(3.7py)装的。还有什么问题可以到github上的DLC社区询问~,大家都很耐心。
—————————————————————————
结果:
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
2021-03-19 13:33:18.111767: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
>>> import deeplabcutcore
>>>
跑起来速度一开始有点慢,到后面速度是超过2080TI的,大概一秒60个迭代~。GPU占用也在95%左右浮动,应该算是安装成功了!
—————————————————————————
一些坑:
1.Not creating XLA devices, tf_xla_enable_xla_devices not set
nitializing ResNet
/home/hjh/miniconda3/envs/tf2.4.1/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer_v1.py:1719: UserWarning: `layer.apply` is deprecated and will be removed in a future version. Please use `layer.__call__` method instead.
warnings.warn('`layer.apply` is deprecated and '
WARNING:tensorflow:From /home/hjh/miniconda3/envs/tf2.4.1/lib/python3.7/site-packages/deeplabcutcore/pose_estimation_tensorflow/nnet/losses.py:38: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
2021-03-19 09:57:14.579849: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-03-19 09:57:14.580951: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.1/lib64:
2021-03-19 09:57:14.580964: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2021-03-19 09:57:14.580975: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (hjhPC): /proc/driver/nvidia/version does not exist
2021-03-19 09:57:14.582528: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
Loading ImageNet-pretrained resnet_50
2021-03-19 09:57:14.881725: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-03-19 09:57:15.070222: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:196] None of the MLIR optimization passes are enabled (registered 0 passes)
2021-03-19 09:57:15.119667: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2899885000 Hz
INFO:tensorflow:Restoring parameters from /home/hjh/miniconda3/envs/test2/lib/python3.7/site-packages/deeplabcut/pose_estimation_tensorflow/models/pretrained/resnet_v1_50.ckpt
Training parameter:
Starting training....
iteration: 50 loss: 0.1026 lr: 0.005
iteration: 100 loss: 0.0309 lr: 0.005
iteration: 150 loss: 0.0269 lr: 0.005
iteration: 200 loss: 0.0265 lr: 0.005
2.Please also try adding directory that contains libnvidia-ml.so to your system PATH
Please also try adding directory that contains libnvidia-ml.so to your system PATH
上面两个都是GPU驱动没打好,跑起来用得CPU,很慢,而且CPU单核占用1000%了,GPU还是0%。(输入top查看CPU占用)
3.如果提示无法成功打开library libcudart.so.11.0,检查一下环境变量
4.反复出现
successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
并不会导致错误,它只是在寻找其他GPU。我一开始感觉迭代跟2080TI比跑得并没有快太多,误会了以为这个是报错,其实GPU已经在满载运行了。建议即时查看GPU使用情况
watch -n 2 nvidia-smi
其实还碰到了很多其他坑,但是解决的时候没有及时把错误和解决方法记录下来。后面找到或者碰到了再补吧~