目录

  • 前言
  • 1. 安装 cuda
  • 2. GPU版本TensorFlow
  • 3. GPU版本Pytorch



前言

本篇给需要配置 GPU 版本 TensorFlow 和 Pytorch 的朋友。

先前写过一篇配置环境的流程,但我自己还是踩了很多版本的坑,所以!这次我把 cuda 文件直接放到百度云给你们,不需要自己到官网挑选啦!

先快速把环境搭起来再说!我们没必要花太多时间卡在这种地方,配环境是真的很气人,欲哭无泪。

这里没有对很多地方进行原理性解释,有些步骤我一笔带过,小白可以到这篇《深度学习TensorFlow开发环境搭建教程》里边查看具体做法,或者自行百度。
 

1. 安装 cuda

先安装Anaconda,然后安装 CUDA 10.1,直接到百度云下载。(就装这个版本算了,不要皮,求稳)

链接:https://pan.baidu.com/s/19Sr66HybqudCbJ6Ela750Q 提取码:mzes

双击安装cuda,默认往下走,别多看,看也看不懂。然后将 cudnn 的文件放到对应位置,配置环境变量,对照下图。

在GPU上部署docker gpu环境搭建_gpu


 

2. GPU版本TensorFlow

  1. 创建环境:conda create --name tf-gpu python=3.7.1
  2. 激活环境:activate tf-gpu
  3. 下载GPU版本的TensorFlow:pip install tensorflow-gpu==2.3.0 -i http://pypi.douban.com/simple --trusted-host pypi.douban.com
  4. 输入:python,然后开始输入代码检查 gpu 是否可用
# 测试gpu是否可用
import tensorflow as tf
tf.test.is_gpu_available()

# 输出
True

到这一步为止可能会报错,如果没有报错则跳过下面这段。

# 一步之遥,有些dll读不到,但是 cuda 路径下文件一个不差
>>> tf.test.is_gpu_available()
...
2021-03-16 19:49:25.587973: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library nvcuda.dll
2021-03-16 19:49:25.638884: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1060 3GB computeCapability: 6.1
coreClock: 1.7335GHz coreCount: 9 deviceMemorySize: 3.00GiB deviceMemoryBandwidth: 178.99GiB/s
2021-03-16 19:49:25.654811: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll not found
2021-03-16 19:49:25.663812: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cublas64_10.dll'; dlerror: cublas64_10.dll not found
2021-03-16 19:49:25.674031: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cufft64_10.dll'; dlerror: cufft64_10.dll not found
2021-03-16 19:49:25.683725: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'curand64_10.dll'; dlerror: curand64_10.dll not found
2021-03-16 19:49:25.693968: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cusolver64_10.dll'; dlerror: cusolver64_10.dll not found
2021-03-16 19:49:25.703211: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cusparse64_10.dll'; dlerror: cusparse64_10.dll not found
2021-03-16 19:49:25.791687: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudnn64_7.dll
2021-03-16 19:49:25.799832: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1753] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2021-03-16 19:49:25.915500: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-03-16 19:49:25.923543: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0
2021-03-16 19:49:25.927894: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N
2021-03-16 19:49:25.934997: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2544290c000 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2021-03-16 19:49:25.944446: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce GTX 1060 3GB, Compute Capability 6.1
False

先检查环境变量,纠正后还不行再往下看。

将报错中需要的 dll 放到系统路径就能拿到了,路径如下截图所示C:/Windows/System

在GPU上部署docker gpu环境搭建_在GPU上部署docker_02

# 继续报错,而后重启 Prompt 就可以了!
2021-03-16 20:26:27.550551: E tensorflow/stream_executor/cuda/cuda_driver.cc:1398] failed to query total available memory: CUDA_ERROR_UNKNOWN: unknown error
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Luo\AppData\Local\conda\conda\envs\lww-tf-gpu\lib\site-packages\tensorflow\python\util\deprecation.py", line 324, in new_func
    return func(*args, **kwargs)
  File "C:\Users\Luo\AppData\Local\conda\conda\envs\lww-tf-gpu\lib\site-packages\tensorflow\python\framework\test_util.py", line 1563, in is_gpu_available
    for local_device in device_lib.list_local_devices():
  File "C:\Users\Luo\AppData\Local\conda\conda\envs\lww-tf-gpu\lib\site-packages\tensorflow\python\client\device_lib.py", line 43, in list_local_devices
    _convert(s) for s in _pywrap_device_lib.list_devices(serialized_config)
RuntimeError: cudaGetDevice() failed. Status: cudaGetErrorString symbol not found.

 

3. GPU版本Pytorch

  1. 创建环境:conda create --name torch-gpu python=3.7.1
  2. 激活环境:activate torch-gpu
  3. 下载GPU版本的 Pytorch:
    conda install pytorch cudatoolkit=10.1 -i http://mirrors.aliyun.com/pypi/simple --trusted-host mirrors.aliyun.com

千万不要去官网下载,700M要下载一整天,还会断网下载失败……

如果能够安装成功,输入:python,检查 gpu 是否可用

# 测试语句
import torch
torch.cuda.is_available()

# 输出
True

输出True就成功啦!如果不幸是false,唉,就麻烦了,百度去吧,可能NVIDA驱动版本问题。

(我这里保证 cuda, cudnn, torch 的版本一定是匹配的,因为跟我本人用的是同一套,祝好运!)