概述
腾讯的 GPU 虚拟化方案是基于 NVIDIA 引入的 DevicePlugin 机制实现的,它通过在 Kubernetes Node 上配置 tencent.com/vcuda-core 和 tencent.com/vcuda-memory 两种资源类型,并为 Pod 动态分配 GPU 资源,使得用户可以更加便捷地使用 Kubernetes 中的 GPU 资源。除兼容Nvidia 官方插件的GPU资源管理功能外,还增加碎片资源调度、GPU调度拓扑优化、GPU资源Quota等功能,在容器层面实现了GPU资源的化整为零,而在原理上仅使用了wrap library和linux动态库链接技术,就实现了GPU 算力和显存的上限隔离。
值得注意的是,它是针对于NVIDIA的设备而设计的,因此只能作用于NVIDIA显卡;大部分的虚拟化技术都是根据某个品牌的显卡架构进行设计的。
在工程设计上,GPUManager方案包括三个部分,cuda封装库vcuda、k8s device plugin 插件gpu-manager-daemonset和k8s调度插件gpu-quota-admission。
- vcuda库是一个对nvidia-ml和libcuda库的封装库,通过劫持容器内用户程序的cuda调用限制当前容器内进程对GPU和显存的使用
- gpu-manager-daemonset是标准的k8s device plugin,实现了GPU拓扑感知、设备和驱动映射等功能。GPUManager支持共享和独占两种模式,当负载里tencent.com/vcuda-core request 值在0~100情况下,采用共享模式调度,优先将碎片资源集中到一张卡上,当负载里的tencent.com/vcuda-core request为100的倍数时,采用独占模式调度,gpu-manager-daemonset会根据GPU拓扑结构生成GPU卡的拓扑树,选择最优的结构(距离最短的叶子节点)进行调度分配。需要注意的是GPUManager仅支持0~100和100的整数倍的GPU需求调度,无法支持150,220类的非100整数倍的GPU需求调度。每张 GPU 卡一共有100个单位的资源,仅支持0 - 1的小数卡,以及1的倍数的整数卡设置。显存资源是以256MiB为最小的一个单位的分配显存。
- gpu-quota-admission是一个k8s Scheduler extender,实现了Scheduler的predicates接口,kube-scheduler在调度tencent.com/vcuda-core资源请求的Pod时,predicates阶段会调用gpu-quota-admission的predicates接口对节点进行过滤和绑定,同时gpu-quota-admission提供了GPU资源池调度功能,解决不同类型的GPU在namespace下的配额问题.
部署
安装显卡驱动
本文章使用的是nvidia 的Tesla T4 显卡,Ubuntu16.04系统。
1.禁用nouveau
ubuntu 16.04默认安装了第三方开源的驱动程序nouveau,安装nvidia显卡驱动首先需要禁用nouveau,不然会碰到冲突的问题,导致无法安装nvidia显卡驱动。
# 打开/etc/modprobe.d/blacklist.conf文件,写入下面两行
blacklist nouveau
options nouveau modeset=0
# 更新系统
update-initramfs -u
# 一定要重启系统
reboot
# 重启后查看是否禁用
lsmod | grep nouveau
# 什么也不输出说明禁用成功
2.下载驱动文件并安装
前往 显卡驱动官网 查找并下载与显卡对应的显卡驱动,下载后为.run文件:
将文件(NVIDIA-Linux-x86_64-525.105.17.run)上传至服务器并执行以下步骤:
# 关闭图形界面,必须关闭
service lightdm stop
# 卸载系统中存在的驱动,推荐最好执行
apt-get remove nvidia-*
# 给文件赋权限
chmod +x NVIDIA-Linux-x86_64-525.105.17.run
# 执行安装
#-no-x-check 安装驱动时关闭X服务
#-no-nouveau-check 安装驱动时禁用nouveau
#-no-opengl-files 只安装驱动文件,不安装OpenGL文件
./NVIDIA-Linux-x86_64-450.57.run -no-x-check -no-nouveau-check -no-opengl-files
# 按照提示进行安装完成后挂载挂载Nvidia驱动
modprobe nvidia
#检查驱动是否安装成功,出现以下结果说明安装成功
nvidia-smi
Fri Jun 2 14:51:26 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:18:00.0 Off | 0 |
| N/A 49C P0 27W / 70W | 2MiB / 15360MiB | 8% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
#下面命令也是检查方法,一般出现后面5个相关设备
ls /dev | grep nvidia
nvidia0
nvidia-caps
nvidiactl
nvidia-uvm
nvidia-uvm-tools
部署gpu-quota-admission
源码:https://github.com/tkestack/gpu-admission
# 创建configMap
apiVersion: v1
kind: ConfigMap
metadata:
name: gpu-quota-admission
namespace: kube-system
data:
gpu-quota-admission.config: |
{
"QuotaConfigMapName": "gpuquota",
"QuotaConfigMapNamespace": "kube-system",
"GPUModelLabel": "gaia.tencent.com/gpu-model",
"GPUPoolLabel": "gaia.tencent.com/gpu-pool"
}
# 创建service
apiVersion: v1
kind: Service
metadata:
name: gpu-quota-admission
namespace: kube-system
spec:
ports:
- port: 3456
protocol: TCP
targetPort: 3456
selector:
k8s-app: gpu-quota-admission
type: ClusterIP
# 创建控制器
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
k8s-app: gpu-quota-admission
name: gpu-quota-admission
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
k8s-app: gpu-quota-admission
template:
metadata:
labels:
k8s-app: gpu-quota-admission
namespace: kube-system
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
matchExpressions:
- key: node-role.kubernetes.io/master
operator: Exists
weight: 1
containers:
- env:
- name: LOG_LEVEL
value: "4"
- name: EXTRA_FLAGS
value: --incluster-mode=true
image: ccr.ccs.tencentyun.com/tkeimages/gpu-quota-admission:latest
imagePullPolicy: IfNotPresent
name: gpu-quota-admission
ports:
- containerPort: 3456
protocol: TCP
resources:
limits:
cpu: "2"
memory: 2Gi
requests:
cpu: "1"
memory: 1Gi
volumeMounts:
- mountPath: /root/gpu-quota-admission/
name: config
dnsPolicy: ClusterFirstWithHostNet
initContainers:
- command:
- sh
- -c
- ' mkdir -p /etc/kubernetes/ && cp /root/gpu-quota-admission/gpu-quota-admission.config
/etc/kubernetes/'
image: busybox
imagePullPolicy: Always
name: init-kube-config
securityContext:
privileged: true
volumeMounts:
- mountPath: /root/gpu-quota-admission/
name: config
priority: 2000000000
priorityClassName: system-cluster-critical
restartPolicy: Always
serviceAccount: gpu-manager
serviceAccountName: gpu-manager
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
volumes:
- configMap:
defaultMode: 420
name: gpu-quota-admission
name: config
部署gpu-manager
源码:https://github.com/tkestack/gpu-manager/tree/master
源码中有yaml配置文件,gpu-manager的配置文件需要进行更改
1.13版本以后的k8s默认不在master节点上创建pod,如果你需要在master节点上创建pod的话,需要在spec.template.spec.tolerations下添加容忍策略:
- key: node-role.kubernetes.io/master ##容忍策略
effect: NoSchedule
# 创建ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: gpu-manager
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-admin
subjects:
- kind: ServiceAccount
name: gpu-manager
namespace: kube-system
# 创建ServiceAccount
apiVersion: v1
kind: ServiceAccount
metadata:
name: gpu-manager
namespace: kube-system
# 创建service
apiVersion: v1
kind: Service
metadata:
name: gpu-manager-metric
namespace: kube-system
annotations:
prometheus.io/scrape: "true"
labels:
kubernetes.io/cluster-service: "true"
spec:
clusterIP: None
ports:
- name: metrics
port: 5678
protocol: TCP
targetPort: 5678
selector:
name: gpu-manager-ds
#创建DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: gpu-manager-daemonset
namespace: kube-system
spec:
updateStrategy:
type: RollingUpdate
selector:
matchLabels:
name: gpu-manager-ds
template:
metadata:
# This annotation is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
labels:
name: gpu-manager-ds
spec:
serviceAccount: gpu-manager
tolerations:
# This toleration is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
- key: CriticalAddonsOnly
operator: Exists
- key: tencent.com/vcuda-core
operator: Exists
effect: NoSchedule
- key: node-role.kubernetes.io/master ##增加容忍策略
effect: NoSchedule
# Mark this pod as a critical add-on; when enabled, the critical add-on
# scheduler reserves resources for critical add-on pods so that they can
# be rescheduled after a failure.
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
priorityClassName: "system-node-critical"
# only run node has gpu device
nodeSelector:
nvidia-device-enable: enable
hostPID: true
containers:
- image: tkestack/gpu-manager:v1.1.5 #指定镜像版本
imagePullPolicy: IfNotPresent #使用本地镜像
name: gpu-manager
securityContext:
privileged: true
ports:
- containerPort: 5678
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
- name: vdriver
mountPath: /etc/gpu-manager/vdriver
- name: vmdata
mountPath: /etc/gpu-manager/vm
- name: log
mountPath: /var/log/gpu-manager
- name: checkpoint
mountPath: /etc/gpu-manager/checkpoint
- name: run-dir
mountPath: /var/run
- name: cgroup
mountPath: /sys/fs/cgroup
readOnly: true
- name: usr-directory
mountPath: /usr/local/host
readOnly: true
- name: kube-root #增加挂载
mountPath: /root/.kube
readOnly: true
env:
- name: LOG_LEVEL
value: "4"
- name: EXTRA_FLAGS #增加启动参数⬇
value: "--logtostderr=false --cgroup-driver=systemd"
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
volumes:
- name: device-plugin
hostPath:
type: Directory
path: /var/lib/kubelet/device-plugins
- name: vmdata
hostPath:
type: DirectoryOrCreate
path: /etc/gpu-manager/vm
- name: vdriver
hostPath:
type: DirectoryOrCreate
path: /etc/gpu-manager/vdriver
- name: log
hostPath:
type: DirectoryOrCreate
path: /etc/gpu-manager/log
- name: checkpoint
hostPath:
type: DirectoryOrCreate
path: /etc/gpu-manager/checkpoint
# We have to mount the whole /var/run directory into container, because of bind mount docker.sock
# inode change after host docker is restarted
- name: run-dir
hostPath:
type: Directory
path: /var/run
- name: cgroup
hostPath:
type: Directory
path: /sys/fs/cgroup
# We have to mount /usr directory instead of specified library path, because of non-existing
# problem for different distro
- name: usr-directory
hostPath:
type: Directory
path: /usr
- name: kube-root #增加挂载
hostPath:
type: Directory
path: /root/.kube
# 给GPU节点打nvidia-device-enable=enable 标签
#*.*.*.*为节点ip或名称
kubectl label node *.*.*.* nvidia-device-enable=enable
#验证gpu-manager-daemonset是否正确派发到GPU节点
kubectl get pods -n kube-system
自定义调度器
准备自定义调度器文件 /etc/kubernetes/scheduler-policy-config.json,文件内容如下:
{
"kind": "Policy",
"apiVersion": "v1",
"predicates": [
{
"name": "PodFitsHostPorts"
},
{
"name": "PodFitsResources"
},
{
"name": "NoDiskConflict"
},
{
"name": "MatchNodeSelector"
},
{
"name": "HostName"
}
],
"extenders": [
{
"urlPrefix": "http://gpu-quota-admission.kube-system:3456/scheduler",
"apiVersion": "v1beta1",
"filterVerb": "predicates",
"enableHttps": false,
"nodeCacheCapable": false
}
],
"hardPodAffinitySymmetricWeight": 10,
"alwaysCheckAllPredicates": false
}
//其中"urlPrefix": "http://gpu-quota-admission.kube-system:3456/scheduler"中
//的IP地址和端口号,
//如果有特殊需求则按照需求更换,没有特殊需求这样写就可以了
如果是kubeadm部署的k8s,调度器是以pod形式运行的,kubelet会一直监听manifest文件的修改,发现文件被修改后会自动重启pod以加载新的配置。因此,这里我们只需要修改调度器的manifest文件即可 。
# 修改/etc/kubernetes/manifests/kube-scheduler.yaml文件,建议先备份
#修改后文件:
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
component: kube-scheduler
tier: control-plane
name: kube-scheduler
namespace: kube-system
spec:
containers:
- command:
- kube-scheduler
- --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
- --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
- --bind-address=127.0.0.1
- --kubeconfig=/etc/kubernetes/scheduler.conf
- --leader-elect=true
- --policy-config-file=/etc/kubernetes/scheduler-policy-config.json #增加项
- --use-legacy-policy-config=true #增加项
image: registry.aliyuncs.com/google_containers/kube-scheduler:v1.23.17
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 8
httpGet:
host: 127.0.0.1
path: /healthz
port: 10259
scheme: HTTPS
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 15
name: kube-scheduler
resources:
requests:
cpu: 100m
startupProbe:
failureThreshold: 24
httpGet:
host: 127.0.0.1
path: /healthz
port: 10259
scheme: HTTPS
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 15
volumeMounts:
- mountPath: /etc/kubernetes/scheduler.conf
name: kubeconfig
readOnly: true
- mountPath: /etc/kubernetes/scheduler-policy-config.json #将文件挂载
name: policyconfig
readOnly: true
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet # 修改dns策略
priorityClassName: system-node-critical
securityContext:
seccompProfile:
type: RuntimeDefault
volumes:
- hostPath:
path: /etc/kubernetes/scheduler.conf
type: FileOrCreate
name: kubeconfig
- hostPath: #挂载到该 Pod 中的容器内
path: /etc/kubernetes/scheduler-policy-config.json
type: FileOrCreate
name: policyconfig
status: {}
保存之后文件自动生效,等待pod创建完成:
查看GPU节点
# 查看信息
kubectl describe node <gpu-node-name>
Name: senter-nf5270m6
Roles: control-plane,master
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=senter-nf5270m6
kubernetes.io/os=linux
node-role.kubernetes.io/control-plane=
node-role.kubernetes.io/master=
node.kubernetes.io/exclude-from-external-load-balancers=
nvidia-device-enable=enable
nvidia.com/gpu=true
Annotations: flannel.alpha.coreos.com/backend-data: {"VNI":1,"VtepMAC":"96:58:ae:04:f9:d1"}
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: true
flannel.alpha.coreos.com/public-ip: 192.168.80.80
kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Thu, 01 Jun 2023 11:17:18 +0800
Taints: node-role.kubernetes.io/master:NoSchedule
Unschedulable: false
Lease:
HolderIdentity: senter-nf5270m6
AcquireTime: <unset>
RenewTime: Fri, 02 Jun 2023 17:12:09 +0800
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Thu, 01 Jun 2023 17:43:22 +0800 Thu, 01 Jun 2023 17:43:22 +0800 FlannelIsUp Flannel is running on this node
MemoryPressure False Fri, 02 Jun 2023 17:07:23 +0800 Thu, 01 Jun 2023 11:17:17 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Fri, 02 Jun 2023 17:07:23 +0800 Thu, 01 Jun 2023 11:17:17 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Fri, 02 Jun 2023 17:07:23 +0800 Thu, 01 Jun 2023 11:17:17 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Fri, 02 Jun 2023 17:07:23 +0800 Thu, 01 Jun 2023 16:42:20 +0800 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 192.168.80.80
Hostname: senter-nf5270m6
Capacity:
cpu: 24
ephemeral-storage: 238732896Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 32584332Ki
pods: 110
tencent.com/vcuda-core: 100 #####GPU 卡核心数的上限 100就是1张显卡
tencent.com/vcuda-memory: 60 #####显存资源,1份是256M
Allocatable:
cpu: 24
ephemeral-storage: 220016236590
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 32481932Ki
pods: 110
tencent.com/vcuda-core: 100
tencent.com/vcuda-memory: 60
System Info:
Machine ID: cdc701b19ee14765b34e2685d788ed94
System UUID: 14C4C29C-5485-03ED-0010-DEBF808F0473
Boot ID: 5adf52b1-b384-46bd-b9a0-716cac22ae2e
Kernel Version: 4.15.0-142-generic
OS Image: Ubuntu 16.04.5 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://20.10.7
Kubelet Version: v1.23.17
Kube-Proxy Version: v1.23.17
PodCIDR: 10.244.0.0/24
PodCIDRs: 10.244.0.0/24
Non-terminated Pods: (13 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
default vcuda-test-655895cd9c-hghk6 4 (16%) 4 (16%) 8Gi (25%) 8Gi (25%) 7h8m
kube-flannel kube-flannel-ds-4g2gh 100m (0%) 0 (0%) 50Mi (0%) 0 (0%) 29h
kube-system coredns-6d8c4cb4d-sxgb8 100m (0%) 0 (0%) 70Mi (0%) 170Mi (0%) 23h
kube-system coredns-6d8c4cb4d-zk48c 100m (0%) 0 (0%) 70Mi (0%) 170Mi (0%) 23h
kube-system etcd-senter-nf5270m6 100m (0%) 0 (0%) 100Mi (0%) 0 (0%) 29h
kube-system gpu-manager-daemonset-7rzvs 0 (0%) 0 (0%) 0 (0%) 0 (0%) 23h
kube-system gpu-quota-admission-7697596f4-lcxn8 1 (4%) 2 (8%) 1Gi (3%) 2Gi (6%) 25h
kube-system kube-apiserver-senter-nf5270m6 250m (1%) 0 (0%) 0 (0%) 0 (0%) 29h
kube-system kube-controller-manager-senter-nf5270m6 200m (0%) 0 (0%) 0 (0%) 0 (0%) 29h
kube-system kube-proxy-r8j2t 0 (0%) 0 (0%) 0 (0%) 0 (0%) 29h
kube-system kube-scheduler-senter-nf5270m6 100m (0%) 0 (0%) 0 (0%) 0 (0%) 104s
kubernetes-dashboard dashboard-metrics-scraper-799d786dbf-6rbbj 0 (0%) 0 (0%) 0 (0%) 0 (0%) 29h
kubernetes-dashboard kubernetes-dashboard-6b6b86c4c5-grvcj 0 (0%) 0 (0%) 0 (0%) 0 (0%) 29h
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 5950m (24%) 6 (25%)
memory 9506Mi (29%) 10580Mi (33%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
tencent.com/vcuda-core 20 20 ###已使用的gpu核心
tencent.com/vcuda-memory 25 25 ###已使用的显存
Events: <none>
使用测试
使用TensorFlow框架+minst数据集进行测试验证
镜像: ccr.ccs.tencentyun.com/menghe/tensorflow-gputest:0.2
拉取很慢,大概需要20分钟,所以pod创建过程可能会出错,但会自动重启接着拉取
# 创建测试样例
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
k8s-app: vcuda-test
qcloud-app: vcuda-test
name: vcuda-test
namespace: default
spec:
replicas: 1
selector:
matchLabels:
k8s-app: vcuda-test
template:
metadata:
labels:
k8s-app: vcuda-test
qcloud-app: vcuda-test
spec:
tolerations:
- key: node-role.kubernetes.io/master ##容忍策略
effect: NoSchedule
containers:
- command:
- sleep
- 360000s
env:
- name: PATH
value: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
image: ccr.ccs.tencentyun.com/menghe/tensorflow-gputest:0.2
imagePullPolicy: IfNotPresent
name: tensorflow-test
resources:
limits:
cpu: "4"
memory: 8Gi
tencent.com/vcuda-core: "50"
tencent.com/vcuda-memory: "35"
requests:
cpu: "4"
memory: 8Gi
tencent.com/vcuda-core: "50"
tencent.com/vcuda-memory: "35"
之后会在默认空间中创建pod并运行
进入pod:
运行以下命令进行训练:
cd /data/tensorflow/mnist
time python convolutional.py
在服务器上查询可以看到gpu设备正在被使用: