概述

腾讯的 GPU 虚拟化方案是基于 NVIDIA 引入的 DevicePlugin 机制实现的,它通过在 Kubernetes Node 上配置 tencent.com/vcuda-core 和 tencent.com/vcuda-memory 两种资源类型,并为 Pod 动态分配 GPU 资源,使得用户可以更加便捷地使用 Kubernetes 中的 GPU 资源。除兼容Nvidia 官方插件的GPU资源管理功能外,还增加碎片资源调度、GPU调度拓扑优化、GPU资源Quota等功能,在容器层面实现了GPU资源的化整为零,而在原理上仅使用了wrap library和linux动态库链接技术,就实现了GPU 算力和显存的上限隔离。

值得注意的是,它是针对于NVIDIA的设备而设计的,因此只能作用于NVIDIA显卡;大部分的虚拟化技术都是根据某个品牌的显卡架构进行设计的

在工程设计上,GPUManager方案包括三个部分,cuda封装库vcuda、k8s device plugin 插件gpu-manager-daemonset和k8s调度插件gpu-quota-admission。

  • vcuda库是一个对nvidia-ml和libcuda库的封装库,通过劫持容器内用户程序的cuda调用限制当前容器内进程对GPU和显存的使用
  • gpu-manager-daemonset是标准的k8s device plugin,实现了GPU拓扑感知、设备和驱动映射等功能。GPUManager支持共享和独占两种模式,当负载里tencent.com/vcuda-core request 值在0~100情况下,采用共享模式调度,优先将碎片资源集中到一张卡上,当负载里的tencent.com/vcuda-core request为100的倍数时,采用独占模式调度,gpu-manager-daemonset会根据GPU拓扑结构生成GPU卡的拓扑树,选择最优的结构(距离最短的叶子节点)进行调度分配。需要注意的是GPUManager仅支持0~100和100的整数倍的GPU需求调度,无法支持150,220类的非100整数倍的GPU需求调度。每张 GPU 卡一共有100个单位的资源,仅支持0 - 1的小数卡,以及1的倍数的整数卡设置。显存资源是以256MiB为最小的一个单位的分配显存。
  • gpu-quota-admission是一个k8s Scheduler extender,实现了Scheduler的predicates接口,kube-scheduler在调度tencent.com/vcuda-core资源请求的Pod时,predicates阶段会调用gpu-quota-admission的predicates接口对节点进行过滤和绑定,同时gpu-quota-admission提供了GPU资源池调度功能,解决不同类型的GPU在namespace下的配额问题.

部署

安装显卡驱动

本文章使用的是nvidia 的Tesla T4 显卡,Ubuntu16.04系统。

1.禁用nouveau

ubuntu 16.04默认安装了第三方开源的驱动程序nouveau,安装nvidia显卡驱动首先需要禁用nouveau,不然会碰到冲突的问题,导致无法安装nvidia显卡驱动。

# 打开/etc/modprobe.d/blacklist.conf文件,写入下面两行
blacklist nouveau
options nouveau modeset=0

# 更新系统
update-initramfs -u

# 一定要重启系统
reboot

# 重启后查看是否禁用
lsmod | grep nouveau
# 什么也不输出说明禁用成功

2.下载驱动文件并安装

前往 显卡驱动官网 查找并下载与显卡对应的显卡驱动,下载后为.run文件:

k8s部署使用腾讯vCUDA(GPU-manager)_Memory

k8s部署使用腾讯vCUDA(GPU-manager)_gpu-manager_02

将文件(NVIDIA-Linux-x86_64-525.105.17.run)上传至服务器并执行以下步骤:

# 关闭图形界面,必须关闭
service lightdm stop

# 卸载系统中存在的驱动,推荐最好执行
apt-get remove nvidia-*

# 给文件赋权限
chmod +x NVIDIA-Linux-x86_64-525.105.17.run 

# 执行安装
#-no-x-check 安装驱动时关闭X服务
#-no-nouveau-check 安装驱动时禁用nouveau
#-no-opengl-files 只安装驱动文件,不安装OpenGL文件
./NVIDIA-Linux-x86_64-450.57.run -no-x-check -no-nouveau-check -no-opengl-files

# 按照提示进行安装完成后挂载挂载Nvidia驱动
modprobe nvidia

#检查驱动是否安装成功,出现以下结果说明安装成功
nvidia-smi

Fri Jun  2 14:51:26 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:18:00.0 Off |                    0 |
| N/A   49C    P0    27W /  70W |      2MiB / 15360MiB |      8%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

#下面命令也是检查方法,一般出现后面5个相关设备
ls /dev | grep nvidia

nvidia0
nvidia-caps
nvidiactl
nvidia-uvm
nvidia-uvm-tools

部署gpu-quota-admission

源码:https://github.com/tkestack/gpu-admission

# 创建configMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: gpu-quota-admission
  namespace: kube-system
data:
  gpu-quota-admission.config: |
    {
         "QuotaConfigMapName": "gpuquota",
         "QuotaConfigMapNamespace": "kube-system",
         "GPUModelLabel": "gaia.tencent.com/gpu-model",
         "GPUPoolLabel": "gaia.tencent.com/gpu-pool"
     }

# 创建service
apiVersion: v1
kind: Service
metadata:
  name: gpu-quota-admission
  namespace: kube-system
spec:
  ports:
  - port: 3456
    protocol: TCP
    targetPort: 3456
  selector:
    k8s-app: gpu-quota-admission
  type: ClusterIP

# 创建控制器
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    k8s-app: gpu-quota-admission
  name: gpu-quota-admission
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      k8s-app: gpu-quota-admission
  template:
    metadata:
      labels:
        k8s-app: gpu-quota-admission
      namespace: kube-system
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - preference:
              matchExpressions:
              - key: node-role.kubernetes.io/master
                operator: Exists
            weight: 1
      containers:
      - env:
        - name: LOG_LEVEL
          value: "4"
        - name: EXTRA_FLAGS
          value: --incluster-mode=true
        image: ccr.ccs.tencentyun.com/tkeimages/gpu-quota-admission:latest
        imagePullPolicy: IfNotPresent
        name: gpu-quota-admission
        ports:
        - containerPort: 3456
          protocol: TCP
        resources:
          limits:
            cpu: "2"
            memory: 2Gi
          requests:
            cpu: "1"
            memory: 1Gi
        volumeMounts:
        - mountPath: /root/gpu-quota-admission/
          name: config
      dnsPolicy: ClusterFirstWithHostNet
      initContainers:
      - command:
        - sh
        - -c
        - ' mkdir -p /etc/kubernetes/ && cp /root/gpu-quota-admission/gpu-quota-admission.config
          /etc/kubernetes/'
        image: busybox
        imagePullPolicy: Always
        name: init-kube-config
        securityContext:
          privileged: true
        volumeMounts:
        - mountPath: /root/gpu-quota-admission/
          name: config
      priority: 2000000000
      priorityClassName: system-cluster-critical
      restartPolicy: Always
      serviceAccount: gpu-manager
      serviceAccountName: gpu-manager
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
      volumes:
      - configMap:
          defaultMode: 420
          name: gpu-quota-admission
        name: config

部署gpu-manager

源码:https://github.com/tkestack/gpu-manager/tree/master

源码中有yaml配置文件,gpu-manager的配置文件需要进行更改

1.13版本以后的k8s默认不在master节点上创建pod,如果你需要在master节点上创建pod的话,需要在spec.template.spec.tolerations下添加容忍策略:  

- key: node-role.kubernetes.io/master  ##容忍策略

        effect: NoSchedule

# 创建ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: gpu-manager
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
- kind: ServiceAccount
  name: gpu-manager
  namespace: kube-system

# 创建ServiceAccount
apiVersion: v1
kind: ServiceAccount
metadata:
  name: gpu-manager
  namespace: kube-system

# 创建service
apiVersion: v1
kind: Service
metadata:
  name: gpu-manager-metric
  namespace: kube-system
  annotations:
    prometheus.io/scrape: "true"
  labels:
    kubernetes.io/cluster-service: "true"
spec:
  clusterIP: None
  ports:
    - name: metrics
      port: 5678
      protocol: TCP
      targetPort: 5678
  selector:
    name: gpu-manager-ds

#创建DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: gpu-manager-daemonset
  namespace: kube-system
spec:
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      name: gpu-manager-ds
  template:
    metadata:
      # This annotation is deprecated. Kept here for backward compatibility
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        name: gpu-manager-ds
    spec:
      serviceAccount: gpu-manager
      tolerations:
        # This toleration is deprecated. Kept here for backward compatibility
        # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
        - key: CriticalAddonsOnly
          operator: Exists
        - key: tencent.com/vcuda-core
          operator: Exists
          effect: NoSchedule
      	- key: node-role.kubernetes.io/master ##增加容忍策略
        	effect: NoSchedule
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      # only run node has gpu device
      nodeSelector:
        nvidia-device-enable: enable
      hostPID: true
      containers:
        - image: tkestack/gpu-manager:v1.1.5  #指定镜像版本
          imagePullPolicy: IfNotPresent       #使用本地镜像
          name: gpu-manager
          securityContext:
            privileged: true
          ports:
            - containerPort: 5678
          volumeMounts:
            - name: device-plugin
              mountPath: /var/lib/kubelet/device-plugins
            - name: vdriver
              mountPath: /etc/gpu-manager/vdriver
            - name: vmdata
              mountPath: /etc/gpu-manager/vm
            - name: log
              mountPath: /var/log/gpu-manager
            - name: checkpoint
              mountPath: /etc/gpu-manager/checkpoint
            - name: run-dir
              mountPath: /var/run
            - name: cgroup
              mountPath: /sys/fs/cgroup
              readOnly: true
            - name: usr-directory
              mountPath: /usr/local/host
              readOnly: true
						- name: kube-root              #增加挂载
              mountPath: /root/.kube
              readOnly: true
          env:
            - name: LOG_LEVEL
              value: "4"
            - name: EXTRA_FLAGS        #增加启动参数⬇
              value: "--logtostderr=false --cgroup-driver=systemd"  
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
      volumes:
        - name: device-plugin
          hostPath:
            type: Directory
            path: /var/lib/kubelet/device-plugins
        - name: vmdata
          hostPath:
            type: DirectoryOrCreate
            path: /etc/gpu-manager/vm
        - name: vdriver
          hostPath:
            type: DirectoryOrCreate
            path: /etc/gpu-manager/vdriver
        - name: log
          hostPath:
            type: DirectoryOrCreate
            path: /etc/gpu-manager/log
        - name: checkpoint
          hostPath:
            type: DirectoryOrCreate
            path: /etc/gpu-manager/checkpoint
        # We have to mount the whole /var/run directory into container, because of bind mount docker.sock
        # inode change after host docker is restarted
        - name: run-dir
          hostPath:
            type: Directory
            path: /var/run
        - name: cgroup
          hostPath:
            type: Directory
            path: /sys/fs/cgroup
        # We have to mount /usr directory instead of specified library path, because of non-existing
        # problem for different distro
        - name: usr-directory
          hostPath:
            type: Directory
            path: /usr
        - name: kube-root                   #增加挂载
          hostPath: 
            type: Directory
            path: /root/.kube


# 给GPU节点打nvidia-device-enable=enable 标签
#*.*.*.*为节点ip或名称
kubectl label node *.*.*.* nvidia-device-enable=enable 

#验证gpu-manager-daemonset是否正确派发到GPU节点
kubectl get pods -n kube-system

自定义调度器

准备自定义调度器文件 /etc/kubernetes/scheduler-policy-config.json,文件内容如下:

{
  "kind": "Policy",
  "apiVersion": "v1",
  "predicates": [
    {
      "name": "PodFitsHostPorts"
    },
    {
      "name": "PodFitsResources"
    },
    {
      "name": "NoDiskConflict"
    },
    {
      "name": "MatchNodeSelector"
    },
    {
      "name": "HostName"
    }
  ],
  "extenders": [
    {
      "urlPrefix": "http://gpu-quota-admission.kube-system:3456/scheduler",
      "apiVersion": "v1beta1",
      "filterVerb": "predicates",
      "enableHttps": false,
      "nodeCacheCapable": false
    }
  ],
  "hardPodAffinitySymmetricWeight": 10,
  "alwaysCheckAllPredicates": false
}

//其中"urlPrefix": "http://gpu-quota-admission.kube-system:3456/scheduler"中
//的IP地址和端口号,
//如果有特殊需求则按照需求更换,没有特殊需求这样写就可以了

如果是kubeadm部署的k8s,调度器是以pod形式运行的,kubelet会一直监听manifest文件的修改,发现文件被修改后会自动重启pod以加载新的配置。因此,这里我们只需要修改调度器的manifest文件即可 。

# 修改/etc/kubernetes/manifests/kube-scheduler.yaml文件,建议先备份
#修改后文件:
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    component: kube-scheduler
    tier: control-plane
  name: kube-scheduler
  namespace: kube-system
spec:
  containers:
  - command:
    - kube-scheduler
    - --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
    - --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
    - --bind-address=127.0.0.1
    - --kubeconfig=/etc/kubernetes/scheduler.conf
    - --leader-elect=true
    - --policy-config-file=/etc/kubernetes/scheduler-policy-config.json #增加项
    - --use-legacy-policy-config=true 																	#增加项
    image: registry.aliyuncs.com/google_containers/kube-scheduler:v1.23.17
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 8
      httpGet:
        host: 127.0.0.1
        path: /healthz
        port: 10259
        scheme: HTTPS
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
    name: kube-scheduler
    resources:
      requests:
        cpu: 100m
    startupProbe:
      failureThreshold: 24
      httpGet:
        host: 127.0.0.1
        path: /healthz
        port: 10259
        scheme: HTTPS
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
    volumeMounts:
    - mountPath: /etc/kubernetes/scheduler.conf
      name: kubeconfig
      readOnly: true
    - mountPath: /etc/kubernetes/scheduler-policy-config.json #将文件挂载
      name: policyconfig
      readOnly: true
  hostNetwork: true
  dnsPolicy: ClusterFirstWithHostNet                        # 修改dns策略
  priorityClassName: system-node-critical
  securityContext:
    seccompProfile:
      type: RuntimeDefault
  volumes:
  - hostPath:
      path: /etc/kubernetes/scheduler.conf
      type: FileOrCreate
    name: kubeconfig
  - hostPath:                                             #挂载到该 Pod 中的容器内
      path: /etc/kubernetes/scheduler-policy-config.json
      type: FileOrCreate
    name: policyconfig
status: {}

保存之后文件自动生效,等待pod创建完成:

k8s部署使用腾讯vCUDA(GPU-manager)_k8s_03

查看GPU节点

# 查看信息
kubectl describe node  <gpu-node-name>

Name:               senter-nf5270m6
Roles:              control-plane,master
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=senter-nf5270m6
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/control-plane=
                    node-role.kubernetes.io/master=
                    node.kubernetes.io/exclude-from-external-load-balancers=
                    nvidia-device-enable=enable
                    nvidia.com/gpu=true
Annotations:        flannel.alpha.coreos.com/backend-data: {"VNI":1,"VtepMAC":"96:58:ae:04:f9:d1"}
                    flannel.alpha.coreos.com/backend-type: vxlan
                    flannel.alpha.coreos.com/kube-subnet-manager: true
                    flannel.alpha.coreos.com/public-ip: 192.168.80.80
                    kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Thu, 01 Jun 2023 11:17:18 +0800
Taints:             node-role.kubernetes.io/master:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  senter-nf5270m6
  AcquireTime:     <unset>
  RenewTime:       Fri, 02 Jun 2023 17:12:09 +0800
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Thu, 01 Jun 2023 17:43:22 +0800   Thu, 01 Jun 2023 17:43:22 +0800   FlannelIsUp                  Flannel is running on this node
  MemoryPressure       False   Fri, 02 Jun 2023 17:07:23 +0800   Thu, 01 Jun 2023 11:17:17 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Fri, 02 Jun 2023 17:07:23 +0800   Thu, 01 Jun 2023 11:17:17 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Fri, 02 Jun 2023 17:07:23 +0800   Thu, 01 Jun 2023 11:17:17 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Fri, 02 Jun 2023 17:07:23 +0800   Thu, 01 Jun 2023 16:42:20 +0800   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  192.168.80.80
  Hostname:    senter-nf5270m6
Capacity:
  cpu:                       24
  ephemeral-storage:         238732896Ki
  hugepages-1Gi:             0
  hugepages-2Mi:             0
  memory:                    32584332Ki
  pods:                      110
  tencent.com/vcuda-core:    100    #####GPU 卡核心数的上限 100就是1张显卡
  tencent.com/vcuda-memory:  60     #####显存资源,1份是256M
Allocatable:
  cpu:                       24
  ephemeral-storage:         220016236590
  hugepages-1Gi:             0
  hugepages-2Mi:             0
  memory:                    32481932Ki
  pods:                      110
  tencent.com/vcuda-core:    100
  tencent.com/vcuda-memory:  60
System Info:
  Machine ID:                 cdc701b19ee14765b34e2685d788ed94
  System UUID:                14C4C29C-5485-03ED-0010-DEBF808F0473
  Boot ID:                    5adf52b1-b384-46bd-b9a0-716cac22ae2e
  Kernel Version:             4.15.0-142-generic
  OS Image:                   Ubuntu 16.04.5 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://20.10.7
  Kubelet Version:            v1.23.17
  Kube-Proxy Version:         v1.23.17
PodCIDR:                      10.244.0.0/24
PodCIDRs:                     10.244.0.0/24
Non-terminated Pods:          (13 in total)
  Namespace                   Name                                          CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                          ------------  ----------  ---------------  -------------  ---
  default                     vcuda-test-655895cd9c-hghk6                   4 (16%)       4 (16%)     8Gi (25%)        8Gi (25%)      7h8m
  kube-flannel                kube-flannel-ds-4g2gh                         100m (0%)     0 (0%)      50Mi (0%)        0 (0%)         29h
  kube-system                 coredns-6d8c4cb4d-sxgb8                       100m (0%)     0 (0%)      70Mi (0%)        170Mi (0%)     23h
  kube-system                 coredns-6d8c4cb4d-zk48c                       100m (0%)     0 (0%)      70Mi (0%)        170Mi (0%)     23h
  kube-system                 etcd-senter-nf5270m6                          100m (0%)     0 (0%)      100Mi (0%)       0 (0%)         29h
  kube-system                 gpu-manager-daemonset-7rzvs                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         23h
  kube-system                 gpu-quota-admission-7697596f4-lcxn8           1 (4%)        2 (8%)      1Gi (3%)         2Gi (6%)       25h
  kube-system                 kube-apiserver-senter-nf5270m6                250m (1%)     0 (0%)      0 (0%)           0 (0%)         29h
  kube-system                 kube-controller-manager-senter-nf5270m6       200m (0%)     0 (0%)      0 (0%)           0 (0%)         29h
  kube-system                 kube-proxy-r8j2t                              0 (0%)        0 (0%)      0 (0%)           0 (0%)         29h
  kube-system                 kube-scheduler-senter-nf5270m6                100m (0%)     0 (0%)      0 (0%)           0 (0%)         104s
  kubernetes-dashboard        dashboard-metrics-scraper-799d786dbf-6rbbj    0 (0%)        0 (0%)      0 (0%)           0 (0%)         29h
  kubernetes-dashboard        kubernetes-dashboard-6b6b86c4c5-grvcj         0 (0%)        0 (0%)      0 (0%)           0 (0%)         29h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                  Requests      Limits
  --------                  --------      ------
  cpu                       5950m (24%)   6 (25%)
  memory                    9506Mi (29%)  10580Mi (33%)
  ephemeral-storage         0 (0%)        0 (0%)
  hugepages-1Gi             0 (0%)        0 (0%)
  hugepages-2Mi             0 (0%)        0 (0%)
  tencent.com/vcuda-core    20            20           ###已使用的gpu核心
  tencent.com/vcuda-memory  25            25           ###已使用的显存
Events:                     <none>

使用测试

使用TensorFlow框架+minst数据集进行测试验证

镜像: ccr.ccs.tencentyun.com/menghe/tensorflow-gputest:0.2

拉取很慢,大概需要20分钟,所以pod创建过程可能会出错,但会自动重启接着拉取

# 创建测试样例
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    k8s-app: vcuda-test
    qcloud-app: vcuda-test
  name: vcuda-test
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      k8s-app: vcuda-test
  template:
    metadata:
      labels:
        k8s-app: vcuda-test
        qcloud-app: vcuda-test
    spec:
			tolerations:
      	- key: node-role.kubernetes.io/master  ##容忍策略
        	effect: NoSchedule
      containers:
      - command:
        - sleep
        - 360000s
        env:
        - name: PATH
          value: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
        image: ccr.ccs.tencentyun.com/menghe/tensorflow-gputest:0.2
        imagePullPolicy: IfNotPresent
        name: tensorflow-test
        resources:
          limits:
            cpu: "4"
            memory: 8Gi
            tencent.com/vcuda-core: "50"
            tencent.com/vcuda-memory: "35"
          requests:
            cpu: "4"
            memory: 8Gi
            tencent.com/vcuda-core: "50"
            tencent.com/vcuda-memory: "35"

之后会在默认空间中创建pod并运行

进入pod:

k8s部署使用腾讯vCUDA(GPU-manager)_k8s_04

运行以下命令进行训练:

cd /data/tensorflow/mnist

time python convolutional.py

k8s部署使用腾讯vCUDA(GPU-manager)_gpu-manager_05

在服务器上查询可以看到gpu设备正在被使用:

k8s部署使用腾讯vCUDA(GPU-manager)_Memory_06

k8s部署使用腾讯vCUDA(GPU-manager)_显卡驱动_07