支持3090的pytorch

转载

mob6454cc6f6c1c 2024-09-13 01:02:03

文章标签 支持3090的pytorch cuda 深度学习 tensorflow ubuntu 文章分类 PyTorch 人工智能

ubuntu18.04下RTX3090环境配置+CUDA11.1+cudnn8.0.5+tf2.4.0

之前在2080TI上用deeplabcut跑动物轨迹识别，最近忍不住入坑了3090，下面是RTX3090搭建环境的过程。（坑真的超级多！！）

系统：ubuntu18.04
显卡：RTX3090
CUDA：11.1
cudnn：对应CUDA11.1（8.0.5）
tensorflow：2.4.0（2.4.1也行）
python:3.7

因为之前已经习惯用conda安装软件了，但是cuda11.1不支持conda，所以只能去官网下载。但是我们一样需要用conda的虚拟环境，方便我们调试版本。

一、NVIDIA驱动安装

1.建议自己下载驱动安装，我的3090用的驱动是460.56：

官方460.56驱动

sudo chmod 777 NVIDIA-Linux-x86_64-460.56.run
sudo ./NVIDIA-Linux-x86_64-460.56.run

*如果这里安装了驱动，后面CUDA安装过程中，把455驱动【X】改为【】，意思是不从CUDA安装包里安装驱动。
如果是其他显卡可以在官网搜索相关支持版本。
查看安装是否成功：

nvidia-smi

Fri Mar 19 13:30:00 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56       Driver Version: 460.56       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 3090    Off  | 00000000:01:00.0 Off |                  N/A |
| 39%   53C    P0    72W / 370W |      0MiB / 24245MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

二、CUDA11.1安装

1.同上，先chmod给权限，然后sudo安装

CUDA11.1,ubuntu18.04,runfile 这个链接是我用的runfile文件下载链接，需要其他版本的可以自行选择下载。
这里CUDA文件上写的是455.23的驱动，但是我们已经打过最新驱动了，所以选择不安装。

sudo ./cuda_11.1.0_455.23.05_linux.run

CUDA安装在/usr/local/下的cuda-11.1，可以进入查看

2.添加环境变量

进入.bashrc

sudo vim ~/.bashrc

文件末尾添加：

export CUDA_HOME=/usr/local/cuda
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${CUDA_HOME}/lib64
export PATH=${CUDA_HOME}/bin:${PATH}

或者添加我们对应的版本号:

export PATH="/usr/local/cuda-11.1/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-11.1/lib64:$LD_LIBRARY_PATH"

然后一定要source(相当于立刻激活)

sudo source ~/.bashrc

查看是否安装成功

nvcc -V

(tf2.4.1) hjh@hjhPC:/usr/local$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Tue_Sep_15_19:10:02_PDT_2020
Cuda compilation tools, release 11.1, V11.1.74
Build cuda_11.1.TC455_06.29069683_0

三、CUDNN安装

1.cudnn下载

nvidia官网Cudnn 如果没有账号就根据提示注册一个，然后下载相应版本的cudnn。

2.当前目录解压

不要在cuda11.1的目录解压！因为Cudnn解压出来的目录也叫cuda

tar -zxvf cudnn-11.1-linux-x64-v8.0.5.39.tgz

3.把解压出的文件copy到cuda中对应文件夹

sudo cp cuda/lib64/* /usr/local/cuda/lib64/
sudo cp cuda/include/* /usr/local/cuda/include/
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

注意！/usr/local/ 下有一个cuda和一个cuda11.1，放在哪个都行。cuda大概是cuda11.1的软链接。

4.验证安装

cat /usr/local/cuda/include/cudnn_version.h | grep CUDNN_MAJOR -A 2

(这一步网上找的，但是可能新版Cudnn这个文件里面没有这一项了，我没成功，实际上是装好了的。)

四、tensorflow2.4.0安装

这里的tf2.4.0是完全可以用在GPU上的，大家不用担心，反正我没有特意下载GPU版本的tensorflow。用的清华源,命令：

pip install tensorflow==2.4.0 -i https://pypi.tuna.tsinghua.edu.cn/simple

五、deeplabcutcore安装（单纯配置RTX3090炼丹环境的可以跳过）

因为DLC不支持tf2.X,这里用DLCgithub上的解决办法，拿deeplabcutcore跑训练。安装：

pip install deeplabcutcore
pip install tf_slim

记住！！！一定要用python3.7安装和运行！！！，其他版本都会报各种致命错误！我是用conda创建了python3.7的虚拟环境，然后用pip（3.7py）装的。还有什么问题可以到github上的DLC社区询问~，大家都很耐心。
—————————————————————————
结果：

Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
2021-03-19 13:33:18.111767: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
>>> import deeplabcutcore
>>>

跑起来速度一开始有点慢，到后面速度是超过2080TI的，大概一秒60个迭代~。GPU占用也在95％左右浮动，应该算是安装成功了！

—————————————————————————
一些坑：
1.Not creating XLA devices, tf_xla_enable_xla_devices not set

nitializing ResNet
/home/hjh/miniconda3/envs/tf2.4.1/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer_v1.py:1719: UserWarning: `layer.apply` is deprecated and will be removed in a future version. Please use `layer.__call__` method instead.
  warnings.warn('`layer.apply` is deprecated and '
WARNING:tensorflow:From /home/hjh/miniconda3/envs/tf2.4.1/lib/python3.7/site-packages/deeplabcutcore/pose_estimation_tensorflow/nnet/losses.py:38: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
2021-03-19 09:57:14.579849: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-03-19 09:57:14.580951: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.1/lib64:
2021-03-19 09:57:14.580964: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2021-03-19 09:57:14.580975: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (hjhPC): /proc/driver/nvidia/version does not exist
2021-03-19 09:57:14.582528: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
Loading ImageNet-pretrained resnet_50
2021-03-19 09:57:14.881725: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-03-19 09:57:15.070222: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:196] None of the MLIR optimization passes are enabled (registered 0 passes)
2021-03-19 09:57:15.119667: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2899885000 Hz
INFO:tensorflow:Restoring parameters from /home/hjh/miniconda3/envs/test2/lib/python3.7/site-packages/deeplabcut/pose_estimation_tensorflow/models/pretrained/resnet_v1_50.ckpt
Training parameter:
Starting training....
iteration: 50 loss: 0.1026 lr: 0.005
iteration: 100 loss: 0.0309 lr: 0.005
iteration: 150 loss: 0.0269 lr: 0.005
iteration: 200 loss: 0.0265 lr: 0.005

2.Please also try adding directory that contains libnvidia-ml.so to your system PATH

Please also try adding directory that contains libnvidia-ml.so to your system PATH

上面两个都是GPU驱动没打好，跑起来用得CPU，很慢，而且CPU单核占用1000％了，GPU还是0％。（输入top查看CPU占用）

3.如果提示无法成功打开library libcudart.so.11.0，检查一下环境变量
4.反复出现

successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

并不会导致错误，它只是在寻找其他GPU。我一开始感觉迭代跟2080TI比跑得并没有快太多，误会了以为这个是报错，其实GPU已经在满载运行了。建议即时查看GPU使用情况

watch -n 2 nvidia-smi

其实还碰到了很多其他坑，但是解决的时候没有及时把错误和解决方法记录下来。后面找到或者碰到了再补吧~

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：nginx 手册 PDF

下一篇：prometheus 采集k8s node

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯