准备工作
安装 Golang
(1)使用snap安装
snap install go --classic
(2)二进制包安装
下载链接和步骤可参考传送门,主要步骤:
rm -rf /usr/local/go && tar -C /usr/local -xzf go1.21.6.linux-amd64.tar.gz
## ~/.bashrc
export PATH=$PATH:/usr/local/go/bin
## source ~/.bashrc
go version
安装 DCGM
下载链接和步骤可参考传送门,主要步骤:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
dpkg -i cuda-keyring_1.1-1_all.deb
add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ /"
apt-get update
apt-get install -y datacenter-gpu-manager
使用源码构建
DCGM-Exporter其他部署方式,可参考github,这里重点说下源码部署与配置。
git clone https://github.com/NVIDIA/dcgm-exporter.git
cd dcgm-exporter
## go mod超时问题
go env -w GOPROXY=https://goproxy.cn,direct
make binary
make install
...
cd cmd/dcgm-exporter
cp dcgm-exporter /usr/bin
## 临时测试使用
dcgm-exporter &
监控指标文件
cd yourpath/dcgm-exporter/etc
cp default-counters.csv custom-counters.csv
自启脚本
cd /etc/systemd/system
cat > dcgm_exporter.service << EOF
[Unit]
Description=Prometheus dcgm Exporter
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
ExecStart=/usr/bin/dcgm-exporter -f yourpath/dcgm-exporter/etc/custom-counters.csv
[Install]
WantedBy=multi-user.target
EOF
常用命令:
systemctl start dcgm_exporter
systemctl restart dcgm_exporter
systemctl status dcgm_exporter
systemctl enable dcgm_exporter
对接 Prometheus
我这边 prometheus 部署在k8s上,这里提供一个ServiceMonitor
供大家参考:
## cat gpu-monitoring.yaml
---
apiVersion: v1
kind: Endpoints
metadata:
name: gpu-server
namespace: kubesphere-monitoring-system
labels:
k8s-app: gpu-server
subsets:
- addresses:
- ip: 10.x.x.111
- ip: 10.x.x.133
ports:
- name: metrics
port: 9400
protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
name: gpu-server
namespace: kubesphere-monitoring-system
labels:
k8s-app: gpu-server
spec:
ports:
- name: metrics
port: 9400
protocol: TCP
targetPort: 9400
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "3.2.0"
name: gpu-server
namespace: kubesphere-monitoring-system
spec:
endpoints:
- interval: 10s
path: /metrics
targetPort: 9400
port: metrics
jobLabel: k8s-app
namespaceSelector:
matchNames:
- kubesphere-monitoring-system
selector:
matchLabels:
k8s-app: gpu-server
执行:kubectl apply -f gpu-monitoring.yaml
。
对接grafana
我们可以使用grafana dashboard官网上的模版,我们这里使用12639,需要根据实际情况修改下变量:
正确导入和修改变量后,最终的GPU监控图:
监控告警
k8s 部署 DCGM-exporter
详见我的另外一篇文章基于K8S使用DCGM和Prometheus监控GPU。