准备工作

安装 Golang

(1)使用snap安装

snap install go --classic

(2)二进制包安装

下载链接和步骤可参考传送门,主要步骤:

rm -rf /usr/local/go && tar -C /usr/local -xzf go1.21.6.linux-amd64.tar.gz
## ~/.bashrc
export PATH=$PATH:/usr/local/go/bin
## source ~/.bashrc
go version

安装 DCGM

下载链接和步骤可参考传送门,主要步骤:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
dpkg -i cuda-keyring_1.1-1_all.deb
add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ /"
apt-get update
apt-get install -y datacenter-gpu-manager

使用源码构建

DCGM-Exporter其他部署方式,可参考github,这里重点说下源码部署与配置。

git clone https://github.com/NVIDIA/dcgm-exporter.git
cd dcgm-exporter
## go mod超时问题
go env -w GOPROXY=https://goproxy.cn,direct
make binary
make install
...
cd cmd/dcgm-exporter
cp dcgm-exporter /usr/bin
## 临时测试使用
dcgm-exporter &

监控指标文件

cd yourpath/dcgm-exporter/etc
cp default-counters.csv custom-counters.csv

自启脚本

cd /etc/systemd/system
cat > dcgm_exporter.service << EOF
[Unit]
Description=Prometheus dcgm Exporter
 
After=network-online.target  
Wants=network-online.target 

[Service]
Type=simple
ExecStart=/usr/bin/dcgm-exporter -f yourpath/dcgm-exporter/etc/custom-counters.csv

[Install]
WantedBy=multi-user.target
EOF

常用命令:

systemctl start dcgm_exporter
systemctl restart dcgm_exporter
systemctl status dcgm_exporter
systemctl enable dcgm_exporter

对接 Prometheus

我这边 prometheus 部署在k8s上,这里提供一个ServiceMonitor供大家参考:

## cat gpu-monitoring.yaml
---
apiVersion: v1
kind: Endpoints
metadata:
  name: gpu-server
  namespace: kubesphere-monitoring-system
  labels:
    k8s-app: gpu-server
subsets:
- addresses:
  - ip: 10.x.x.111
  - ip: 10.x.x.133
  ports:
  - name: metrics
    port: 9400
    protocol: TCP

---
apiVersion: v1
kind: Service
metadata:
  name: gpu-server
  namespace: kubesphere-monitoring-system
  labels:
    k8s-app: gpu-server
spec:
  ports:
  - name: metrics
    port: 9400
    protocol: TCP
    targetPort: 9400

---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "3.2.0"
  name: gpu-server
  namespace: kubesphere-monitoring-system
spec:
  endpoints:
  - interval: 10s
    path: /metrics
    targetPort: 9400
    port: metrics
  jobLabel: k8s-app
  namespaceSelector:
    matchNames:
    - kubesphere-monitoring-system
  selector:
    matchLabels:
      k8s-app: gpu-server

执行:kubectl apply -f gpu-monitoring.yaml

对接grafana

我们可以使用grafana dashboard官网上的模版,我们这里使用12639,需要根据实际情况修改下变量:

正确导入和修改变量后,最终的GPU监控图:

监控告警

k8s 部署 DCGM-exporter

详见我的另外一篇文章基于K8S使用DCGM和Prometheus监控GPU

参考文档

个人技术私库