kubeflow gpu 池化 kubernetes gpu

转载

架构魔法之光 2024-03-27 11:54:10

文章标签 kubeflow gpu 池化 docker Docker github 文章分类 游戏开发

Kubernetes安装GPU支持插件

kubeflow gpu 池化 kubernetes gpu_github

Kubernetes1.10.x可以直接支持GPU的容器调度运行了，通过安装该插件即可。
这里的方法基于NVIDIA device plugin，仅支持Nvidia的显卡和Tesla计算卡。
主要步骤：

安装图形卡的Nvidia Drivers。
安装Nvidia-Docker2容器运行时。
启用Nvidia-Docker2为容器引擎默认运行时。
启用Docker服务的GPU加速支持。
安装NVIDIA device plugin。
启用Kubernetes的GPU支持。

一、安装NVidia支持的Docker引擎

安装NVidia支持的Docker引擎，就可以在容器中使用GPU了。具体步骤如下：

# If you have nvidia-docker 1.0 installed: we need to remove it and all existing GPU containers
docker volume ls -q -f driver=nvidia-docker | xargs -r -I{} -n1 docker ps -q -a -f volume={} | xargs -r docker rm -f
sudo apt-get purge -y nvidia-docker

# Add the package repositories
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \
  sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update

# Install nvidia-docker2 and reload the Docker daemon configuration
sudo apt-get install -y nvidia-docker2
sudo pkill -SIGHUP dockerd

# Test nvidia-smi with the latest official CUDA image
docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi

注意，现在像上面运行 Docker 可以直接支持GPU了，不用再单独运行Docker-Nvidia命令了，大大增强了与各种容器编排系统的兼容性，Kubernetes目前也已经可以支持Docker容器运行GPU了。

参考

目前版本依赖Docker 18.03版，如果已经安装了其它版本，可以指定安装的版本，如下：

sudo apt install docker-ce=18.03.1~ce-0~ubuntu

二、安装Nvidia的Kubernetes GPU插件

Kubernetes的NVIDIA device plugin是Daemonset，允许自动地：

公开集群中每一个节点的GPU数量。
对GPU运行健康状况保持跟踪。
在Kubernetes中运行支持的GPU容器实例。

Kubernetes device plugin 项目包含Nvidia的官方实现。

2.1 环境要求

运行NVIDIA device plugin的环境要求如下：

NVIDIA drivers ~= 361.93，安装参考
nvidia-docker version > 2.0 (参考上面第一部分)
docker 配置 nvidia-docker为 default runtime.
Kubernetes version = 1.10，安装参考
The DevicePlugins feature gate enabled

2.2 准备 GPU Nodes

下面的步骤需要在所有的GPU节点上执行。此外，在此之前 NVIDIA drivers 和 nvidia-docker 必须已经成功安装。

首先，检查每一个节点，启用 nvidia runtime为缺省的容器运行时。我们将编辑docker daemon config文件，位于/etc/docker/daemon.json

{
    "exec-opts": ["native.cgroupdriver=systemd"],
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

上面的这一行（"exec-opts": ["native.cgroupdriver=systemd"]）是在Ubuntu16.04+DockerCE上面必须要的，否则kubelet无法成功启动（参考）。

如果 runtimes 没有, 到nvidia-docker 参考，首先进行安装。

第二步，启用 DevicePlugins feature gate，在每一个GPU节点都要设置。

如果你的 Kubernetes cluster是通过kubeadm部署的，并且节点运行systemd，需要打开kubeadm 的systemd unit文件，位于 /etc/systemd/system/kubelet.service.d/10-kubeadm.conf 然后添加下面的参数作为环境变量：

Environment="KUBELET_GPU_ARGS=--feature-gates=DevicePlugins=true"

If you spot the Accelerators feature gate you should remove it as it might interfere with the DevicePlugins feature gate

完整的文件如下（/etc/systemd/system/kubelet.service.d/10-kubeadm.conf）：

[Service]
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_SYSTEM_PODS_ARGS=--pod-manifest-path=/etc/kubernetes/manifests --allow-privileged=true"
Environment="KUBELET_NETWORK_ARGS=--network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin"
Environment="KUBELET_DNS_ARGS=--cluster-dns=10.96.0.10 --cluster-domain=cluster.local"
Environment="KUBELET_AUTHZ_ARGS=--authorization-mode=Webhook --client-ca-file=/etc/kubernetes/pki/ca.crt"
Environment="KUBELET_CADVISOR_ARGS=--cadvisor-port=0"
Environment="KUBELET_CERTIFICATE_ARGS=--rotate-certificates=true --cert-dir=/var/lib/kubelet/pki"

Environment="KUBELET_CGROUP_ARGS=--cgroup-driver=systemd"
Environment="KUBELET_EXTRA_ARGS=--fail-swap-on=false"
Environment="KUBELET_GPU_ARGS=--feature-gates=DevicePlugins=true"

ExecStart=
ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_SYSTEM_PODS_ARGS $KUBELET_NETWORK_ARGS $KUBELET_DNS_ARGS $KUBELET_AUTHZ_ARGS $KUBELET_CADVISOR_ARGS $KUBELET_CERTIFICATE_ARGS $KUBELET_CGROUP_ARGS $KUBELET_EXTRA_ARGS $KUBELET_GPU_ARGS

重新载入配置文件，然后重新启动服务：

$ sudo systemctl daemon-reload
$ sudo systemctl restart kubelet

In this guide we used kubeadm and kubectl as the method for setting up and administering the Kubernetes cluster, but there are many ways to deploy a Kubernetes cluster. To enable the DevicePlugins feature gate if you are not using the kubeadm + systemd configuration, you will need to make sure that the arguments that are passed to Kubelet include the following --feature-gates=DevicePlugins=true.

2.3 在Kubernetes中启用GPU支持

完成所有的GPU节点的选项启用，然后就可以在在Kubernetes中启用GPU支持，通过安装Nvidia提供的Daemonset服务来实现，方法如下：

$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.10/nvidia-device-plugin.yml

2.4 运行需要GPU的工作负载

NVIDIA GPUs 调用可以通过容器级别资源请求来实现，使用resource name为 nvidia.com/gpu：

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: cuda-container
      image: nvidia/cuda:9.0-devel
      resources:
        limits:
          nvidia.com/gpu: 2 # requesting 2 GPUs
    - name: digits-container
      image: nvidia/digits:6.0
      resources:
        limits:
          nvidia.com/gpu: 2 # requesting 2 GPUs

⚠️注意: 如果未指定GPU资源请求，在使用 device plugin 的 NVIDIA images，将使用容器中公开的所有GPU资源。

2.5 文档

⚠️请注意：

device plugin feature is still alpha which is why it requires the feature gate to be enabled.
the NVIDIA device plugin is still considered alpha and is missing

Security features
More comprehensive GPU health checking features
GPU cleanup features
...

support will only be provided for the official NVIDIA device plugin.

下面将重点介绍如何构建device plugin和运行。

2.6 通过Docker

构建-Build

选项 1, 拉取预先编译的容器镜像，到 Docker Hub：

$ docker pull nvidia/k8s-device-plugin:1.10

选项 2, 不拉取代码库自行构建容器镜像：

$ docker build -t nvidia/k8s-device-plugin:1.10 https://github.com/NVIDIA/k8s-device-plugin.git#v1.10

选项 3, 拉取代码库自行构建，可以修改：

$ git clone https://github.com/NVIDIA/k8s-device-plugin.git && cd k8s-device-plugin
$ docker build -t nvidia/k8s-device-plugin:1.10 .

本地运行

$ docker run --security-opt=no-new-privileges --cap-drop=ALL --network=none -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:1.10

部署为DaemonSet:

$ kubectl create -f nvidia-device-plugin.yml

2.7 不用Docker

构建-Build

$ C_INCLUDE_PATH=/usr/local/cuda/include LIBRARY_PATH=/usr/local/cuda/lib64 go build

本地运行

$ ./k8s-device-plugin

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：airtest 将一套行为脚本赋值后运行 airtest批量执行脚本

下一篇：容器停止了如何定位容器状态pending

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯