参考文档:https://cloud.tencent.com/developer/article/1685122
https://www.jianshu.com/p/7d795bc226c7
一、GPU虚拟化简介
GPU是一种用于矩阵计算的PCIe设备,一般用于解码、渲染和科学计算等并行计算场景,不同场景对GPU使用方式不同,使用的加速库也各不相同,本文提到的GPU虚拟化主要针对科学计算场景,使用的加速库为nvidia cuda。
从用户角度,GPU虚拟化可以简单分为两种类型,虚拟机层面的虚拟化和容器层面的虚拟化。虚拟机层面的虚拟化是将GPU硬件设备虚拟给多个KVM虚拟机使用,各个虚拟机独立安装驱动,这样既保证了虚拟机内的GPU功能完备又实现GPU资源的隔离和共享,唯一缺点就是资源损耗相对较大。容器层面的虚拟化则有两个思路,一个是将GPU纳入cgroup管理,目前尚未有成熟的提案,短期内难以实现,二是基于GPU驱动封装实现,用户根据需要对驱动的某些关键接口(如显存分配、cuda thread创建等)进行封装劫持,在劫持过程中限制用户进程对计算资源的使用,此类方案缺点是兼容性依赖于厂商驱动,但是整体方案较为轻量化,性能损耗极小。GPUManager即为第二类容器层面的虚拟化方案,本文主要介绍GPUManager方案原理和部署流程。
二、GPUManager架构介绍
GPUManager是一个运行在k8s上的GPU虚拟化方案,了解GPUManager方案架构前我们先看一下k8s对异构资源的支持。1.6版本开始,k8s的in-tree代码里开始引入Nvidia GPU相关的代码,但不支持GPU调度无法在实际生产环境中使用,为了满足越来越多的异构资源(如GPU、Infiniband、FPGA等)使用需求,1.8版本社区提出了Extended Resource和Device Plugin方案,以OutOfTree形式支持异构资源的调度和映射。
GPUManager是腾讯自研的容器层GPU虚拟化方案,除兼容Nvidia 官方插件的GPU资源管理功能外,还增加碎片资源调度、GPU调度拓扑优化、GPU资源Quota等功能,在容器层面实现了GPU资源的化整为零,而在原理上仅使用了wrap library和linux动态库链接技术,就实现了GPU 算力和显存的上限隔离。
在工程设计上,GPUManager方案包括三个部分,cuda封装库vcuda、k8s device plugin 插件gpu-manager-daemonset和k8s调度插件gpu-quota-admission。
vcuda库是一个对nvidia-ml和libcuda库的封装库,通过劫持容器内用户程序的cuda调用限制当前容器内进程对GPU和显存的使用
gpu-manager-daemonset是标准的k8s device plugin,实现了GPU拓扑感知、设备和驱动映射等功能。GPUManager支持共享和独占两种模式,当负载里tencent.com/vcuda-core request 值在0~100情况下,采用共享模式调度,优先将碎片资源集中到一张卡上,当负载里的tencent.com/vcuda-core request为100的倍数时,采用独占模式调度,gpu-manager-daemonset会根据GPU拓扑结构生成GPU卡的拓扑树,选择最优的结构(距离最短的叶子节点)进行调度分配。需要注意的是GPUManager仅支持0~100和100的整数倍的GPU需求调度,无法支持150,220类的非100整数倍的GPU需求调度。每张 GPU 卡一共有100个单位的资源,仅支持0 - 1的小数卡,以及1的倍数的整数卡设置。显存资源是以256MiB为最小的一个单位的分配显存。
gpu-quota-admission是一个k8s Scheduler extender,实现了Scheduler的predicates接口,kube-scheduler在调度tencent.com/vcuda-core资源请求的Pod时,predicates阶段会调用gpu-quota-admission的predicates接口对节点进行过滤和绑定,同时gpu-quota-admission提供了GPU资源池调度功能,解决不同类型的GPU在namespace下的配额问题
GPUManager整体方案如下:
三、GPUManager部署
## github
gpu-admission: https://github.com/tkestack/gpu-admission
gpu-manager: https://github.com/tkestack/gpu-manager
1、驱动安装
2、部署
1)部署gpu-quota-admission服务
kubectl apply -f gpu-admission.yaml
内容如下:
apiVersion: v1
kind: ConfigMap
metadata:
name: gpu-quota-admission
namespace: kube-system
data:
gpu-quota-admission.config: |
{
"QuotaConfigMapName": "gpuquota",
"QuotaConfigMapNamespace": "kube-system",
"GPUModelLabel": "gaia.tencent.com/gpu-model",
"GPUPoolLabel": "gaia.tencent.com/gpu-pool"
}
---
apiVersion: v1
kind: Service
metadata:
name: gpu-quota-admission
namespace: kube-system
spec:
ports:
- port: 3456
protocol: TCP
targetPort: 3456
selector:
k8s-app: gpu-quota-admission
type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
k8s-app: gpu-quota-admission
name: gpu-quota-admission
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
k8s-app: gpu-quota-admission
template:
metadata:
labels:
k8s-app: gpu-quota-admission
namespace: kube-system
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
matchExpressions:
- key: node-role.kubernetes.io/master
operator: Exists
weight: 1
containers:
- env:
- name: LOG_LEVEL
value: "4"
- name: EXTRA_FLAGS
value: --incluster-mode=true
image: ccr.ccs.tencentyun.com/tkeimages/gpu-quota-admission:latest
imagePullPolicy: IfNotPresent
name: gpu-quota-admission
ports:
- containerPort: 3456
protocol: TCP
resources:
limits:
cpu: "2"
memory: 2Gi
requests:
cpu: "1"
memory: 1Gi
volumeMounts:
- mountPath: /root/gpu-quota-admission/
name: config
dnsPolicy: ClusterFirstWithHostNet
initContainers:
- command:
- sh
- -c
- ' mkdir -p /etc/kubernetes/ && cp /root/gpu-quota-admission/gpu-quota-admission.config
/etc/kubernetes/'
image: busybox
imagePullPolicy: Always
name: init-kube-config
securityContext:
privileged: true
volumeMounts:
- mountPath: /root/gpu-quota-admission/
name: config
priority: 2000000000
priorityClassName: system-cluster-critical
restartPolicy: Always
serviceAccount: gpu-manager
serviceAccountName: gpu-manager
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
volumes:
- configMap:
defaultMode: 420
name: gpu-quota-admission
name: config
2 )部署gpu-manager-daemonset
kubectl apply -f gpu-manager.yaml
内容如下:
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: gpu-manager
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-admin
subjects:
- kind: ServiceAccount
name: gpu-manager
namespace: kube-system
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: gpu-manager
namespace: kube-system
---
apiVersion: v1
kind: Service
metadata:
name: gpu-manager-metric
namespace: kube-system
annotations:
prometheus.io/scrape: "true"
labels:
kubernetes.io/cluster-service: "true"
spec:
clusterIP: None
ports:
- name: metrics
port: 5678
protocol: TCP
targetPort: 5678
selector:
name: gpu-manager-ds
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: gpu-manager-daemonset
namespace: kube-system
spec:
updateStrategy:
type: RollingUpdate
selector:
matchLabels:
name: gpu-manager-ds
template:
metadata:
# This annotation is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
labels:
name: gpu-manager-ds
spec:
serviceAccount: gpu-manager
tolerations:
# This toleration is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
- key: CriticalAddonsOnly
operator: Exists
- key: tencent.com/vcuda-core
operator: Exists
effect: NoSchedule
# Mark this pod as a critical add-on; when enabled, the critical add-on
# scheduler reserves resources for critical add-on pods so that they can
# be rescheduled after a failure.
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
priorityClassName: "system-node-critical"
# only run node has gpu device
nodeSelector:
nvidia-device-enable: enable
hostPID: true
containers:
- image: tkestack/gpu-manager:v1.1.5
imagePullPolicy: IfNotPresent
name: gpu-manager
securityContext:
privileged: true
ports:
- containerPort: 5678
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
- name: vdriver
mountPath: /etc/gpu-manager/vdriver
- name: vmdata
mountPath: /etc/gpu-manager/vm
- name: log
mountPath: /var/log/gpu-manager
- name: checkpoint
mountPath: /etc/gpu-manager/checkpoint
- name: run-dir
mountPath: /var/run
- name: cgroup
mountPath: /sys/fs/cgroup
readOnly: true
- name: usr-directory
mountPath: /usr/local/host
readOnly: true
env:
- name: LOG_LEVEL
value: "4"
- name: EXTRA_FLAGS
value: "--logtostderr=false"
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
volumes:
- name: device-plugin
hostPath:
type: Directory
path: /var/lib/kubelet/device-plugins
- name: vmdata
hostPath:
type: DirectoryOrCreate
path: /etc/gpu-manager/vm
- name: vdriver
hostPath:
type: DirectoryOrCreate
path: /etc/gpu-manager/vdriver
- name: log
hostPath:
type: DirectoryOrCreate
path: /etc/gpu-manager/log
- name: checkpoint
hostPath:
type: DirectoryOrCreate
path: /etc/gpu-manager/checkpoint
# We have to mount the whole /var/run directory into container, because of bind mount docker.sock
# inode change after host docker is restarted
- name: run-dir
hostPath:
type: Directory
path: /var/run
- name: cgroup
hostPath:
type: Directory
path: /sys/fs/cgroup
# We have to mount /usr directory instead of specified library path, because of non-existing
# problem for different distro
- name: usr-directory
hostPath:
type: Directory
path: /usr
3 )给GPU节点打nvidia-device-enable=enable 标签
kubectl label node *.*.*.* nvidia-device-enable=enable
4 ) 验证gpu-manager-daemonset是否正确派发到GPU节点
kubectl get pods -n kube-system
3、自定义调度器
1)准备自定义调度器文件 /etc/kubernetes/scheduler-policy-config.json,配置文件内容:
{
"kind": "Policy",
"apiVersion": "v1",
"predicates": [
{
"name": "PodFitsHostPorts"
},
{
"name": "PodFitsResources"
},
{
"name": "NoDiskConflict"
},
{
"name": "MatchNodeSelector"
},
{
"name": "HostName"
}
],
"extenders": [
{
"urlPrefix": "http://gpu-quota-admission.kube-system:3456/scheduler",
"apiVersion": "v1beta1",
"filterVerb": "predicates",
"enableHttps": false,
"nodeCacheCapable": false
}
],
"hardPodAffinitySymmetricWeight": 10,
"alwaysCheckAllPredicates": false
}
其中"urlPrefix": "http://gpu-quota-admission.kube-system:3456/scheduler"中的IP地址和端口号,如果有特殊需求则按照需求更换,没有特殊需求这样写就可以了
2)修改调度器scheduler的manifest文件
如果是kubeadm部署的k8s,调度器是以pod形式运行的,kubelet会一直监听manifest文件的修改,发现文件被修改后会自动重启pod以加载新的配置。因此,这里我们只需要修改调度器的manifest文件即可。
cp /etc/kubernetes/manifests/kube-scheduler.yaml /etc/kubernetes/manifests/kube-scheduler.yaml.bak
在command关键字下面加两行内容:
--policy-config-file=/etc/kubernetes/scheduler-policy-config.json
--use-legacy-policy-config=true
修改后文件为:
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
component: kube-scheduler
tier: control-plane
name: kube-scheduler
namespace: kube-system
spec:
containers:
- command:
- kube-scheduler
- --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
- --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
- --bind-address=127.0.0.1
- --kubeconfig=/etc/kubernetes/scheduler.conf
- --leader-elect=true
- --port=0
- --policy-config-file=/etc/kubernetes/scheduler-policy-config.json #### 增加项
- --use-legacy-policy-config=true #### 增加项
image: 10.2.57.16:5000/kubernetes/kube-scheduler:v1.19.8
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 8
httpGet:
host: 127.0.0.1
path: /healthz
port: 10259
scheme: HTTPS
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 15
name: kube-scheduler
resources:
requests:
cpu: 100m
startupProbe:
failureThreshold: 24
httpGet:
host: 127.0.0.1
path: /healthz
port: 10259
scheme: HTTPS
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 15
volumeMounts:
- mountPath: /etc/kubernetes/scheduler.conf
name: kubeconfig
readOnly: true
- mountPath: /etc/kubernetes/scheduler-policy-config.json #### 将文件挂载
name: policyconfig
readOnly: true
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet #### 修改dns策略
priorityClassName: system-node-critical
volumes:
- hostPath:
path: /etc/kubernetes/scheduler.conf
type: FileOrCreate
name: kubeconfig
- hostPath:
path: /etc/kubernetes/scheduler-policy-config.json
type: FileOrCreate
name: policyconfig
status: {}
保存退出后就自动生效了
可以用如下命令确定一下:
[root@cri3dp1 manifests]# kubectl -n kube-system get pod | grep sch
kube-scheduler-cri3dp1 1/1 Running 0 141m
输出中找到一个名字为 kube-scheduler-XXX 的pod,看后面对应的AGE项,是不是刚刚启动。如果刚启动过,代表调度器配置已经更新。
4、查看gpu节点信息
[root@cri3dp1 manifests]# kubectl describe node k8s-node3
.........
Capacity:
cpu: 20
ephemeral-storage: 958487280Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 65492456Ki
pods: 110
tencent.com/vcuda-core: 100
tencent.com/vcuda-memory: 96
Allocatable:
cpu: 20
ephemeral-storage: 883341875786
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 65390056Ki
pods: 110
tencent.com/vcuda-core: 100
tencent.com/vcuda-memory: 96.........
四、方案测试
方案测试采用Tensorflow框架,内置了Mnist,cifar10和Alexnet benchmark等测试数据集,可以根据需要选择不同的测试方案。
测试步骤:
1、使用TensorFlow框架+minst数据集进行测试验证,TensorFlow镜像:
ccr.ccs.tencentyun.com/menghe/tensorflow-gputest:0.2
2、创建一个测试负载,yaml文件如下:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
k8s-app: vcuda-test
qcloud-app: vcuda-test
name: vcuda-test
namespace: default
spec:
replicas: 1
selector:
matchLabels:
k8s-app: vcuda-test
template:
metadata:
labels:
k8s-app: vcuda-test
qcloud-app: vcuda-test
spec:
containers:
- command:
- sleep
- 360000s
env:
- name: PATH
value: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
image: ccr.ccs.tencentyun.com/menghe/tensorflow-gputest:0.2
imagePullPolicy: IfNotPresent
name: tensorflow-test
resources:
limits:
cpu: "4"
memory: 8Gi
tencent.com/vcuda-core: "50"
tencent.com/vcuda-memory: "32"
requests:
cpu: "4"
memory: 8Gi
tencent.com/vcuda-core: "50"
tencent.com/vcuda-memory: "32"
3、进入测试容器(在默认default namespace下,如修改了测试yaml,按需指定namespace)
kubectl exec -it `kubectl get pods -o name | cut -d '/' -f2` -- bash
4、执行测试命令,可以根据需求选择不同训练框架/数据集
a. Mnist
cd /data/tensorflow/mnist && time python convolutional.py
b. AlexNet
cd /data/tensorflow/alexnet && time python alexnet_benchmark.py
c. Cifar10
cd /data/tensorflow/cifar10 && time python cifar10_train.py
5、在物理机上通过nvidia-smi pmon -s u -d 1命令查看GPU资源使用情况
五、pod使用
下面给出 yaml 示例:
1)使用1张卡的 P4 设备:
apiVersion: v1
kind: Pod
...
spec:
containers:
- name: gpu
resources:
limits:
cpu: "4"
memory: 8Gi
tencent.com/vcuda-core: "100"
requests:
cpu: "4"
memory: 8Gi
tencent.com/vcuda-core: "100"
2)使用0.3张卡,5GiB 显存的应用:
apiVersion: v1
kind: Pod
...
spec:
containers:
- name: gpu
resources:
limits:
cpu: "4"
memory: 8Gi
tencent.com/vcuda-core: "30"
tencent.com/vcuda-memory: "20"
requests:
cpu: "4"
memory: 8Gi
tencent.com/vcuda-core: "30"
tencent.com/vcuda-memory: "20"