前言
注意标题说的监控dell服务器硬件,指的是监控服务器硬件的状态(磁盘,内存,电源等的状态),不是指监控硬件性能,磁盘的空间,内存等的使用量.而是类似于zabbix监控idrac的snmp获取硬件状态.
现在大部分公司是使用prometheus监控容器和服务,zabbix监控硬件,端口,当然还有其他监控架构.这里就不对比各个监控的优劣了.仅仅是做一篇文档.该文档对基础的内容解释不太详尽,仅适合具备一些prometheus基础的查看.不适合未接触者
前提条件
<1>各个需要监控的服务器开启idrac的snmp,并设置团体名,类似于密码(默认是public)
注意自己设置的密码,后面要用到
<2>由于安全问题,对网络一般进行了限制.找一台可以ping通各服务器idrac IP地址的服务器,安装snmp监控组件
<3>prometheus服务器需要能联通snmp_exporter
组件安装
安装依赖
yum -y install gcc gcc-g++ make net-snmp net-snmp-utils net-snmp-libs net-snmp-devel golang git
snmp_exporter安装
<1>下载snmp_exporter
https://github.com/prometheus/snmp_exporter/releases
cd /data
wget https://github.com/prometheus/snmp_exporter/releases/download/v0.20.0/snmp_exporter-0.20.0.linux-amd64.tar.gz
tar xf snmp_exporter-0.20.0.linux-amd64.tar.gz
mv snmp_exporter-0.20.0.linux-amd64 snmp_exporter
<2>配置启动方式
根据系统版本配置启动方式,暂时不需要启动(没有生成snmp)
Centos7
cat /usr/lib/systemd/system/snmp-exporter.service
[Unit]
Description=SNMP exporter
Documentation=https://github.com/prometheus/snmp_exporter
[Service]
ExecStart=/data/snmp_exporter/snmp_exporter \
--config.file=/data/snmp_exporter/snmp.yml \
--web.listen-address=:9116 \
--snmp.wrap-large-counters
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure
[Install]
WantedBy=multi-user.target
管理方式:
systemctl daemon-reload
systemctl enable snmp-exporter
systemctl restart snmp-exporter
systemctl status snmp-exporter
systemctl stop snmp-exporter
Centos6
cat /etc/init.d/snmp_exporter
#!/bin/bash
# chkconfig: 2345 80 80
# description: Start and Stop snmp_exporter
# Source function library.
. /etc/init.d/functions
prog_name="snmp_exporter"
prog_path="/data/${prog_name}"
pidfile="/var/run/${prog_name}.pid"
prog_logs="/data/${prog_name}/${prog_name}.log"
options="--config.file=/data/snmp_exporter/snmp.yml --web.listen-address=:9116 --snmp.wrap-large-counters"
DESC="snmp_exporter"
[ -x "${prog_path}" ] || exit 1
RETVAL=0
start(){
action $"Starting $DESC..." su -s /bin/sh -c "nohup $prog_path $options >> $prog_logs 2>&1 &" 2> /dev/null
RETVAL=$?
PID=$(pidof ${prog_path})
[ ! -z "${PID}" ] && echo ${PID} > ${pidfile}
echo
[ $RETVAL -eq 0 ] && touch /var/lock/subsys/$prog_name
return $RETVAL
}
stop(){
echo -n $"Shutting down $prog_name: "
killproc -p ${pidfile}
RETVAL=$?
echo
[ $RETVAL -eq 0 ] && rm -f /var/lock/subsys/$prog_name
return $RETVAL
}
restart() {
stop
start
}
case "$1" in
start)
start
;;
stop)
stop
;;
restart)
restart
;;
status)
status $prog_path
RETVAL=$?
;;
*)
echo $"Usage: $0 {start|stop|restart|status}"
RETVAL=1
esac
------------------------------------------------------------
cat /etc/sysconfig/snmp_exporter
ARGS=""
------------------------------------------------------------
管理方式:
chmod +x /etc/init.d/snmp_exporter
chkconfig snmp_exporter on
/etc/init.d/snmp_exporter restart
mibs下载并生成snmp.yml
MIB与OID
OID是SNMP代理提供的用于唯一标识一个对象或者信息的id,1.3.6.1.4.1.4413.1.3.2.1.17这样一串数字
MIB是按树结构存放了OID对应信息的数据库
<1>下载适合自己服务器型号的mib,查看兼容的系统
https://www.dell.com/support/search/zh-cn#q=mibs&sort=relevancy&f:langFacet=[zh]
wget https://dl.dell.com/FOLDER06009600M/1/Dell-OM-MIBS-940_A00.zip
unzip Dell-OM-MIBS-940_A00.zip
<2>查看OID
snmptranslate -Tz -m /root/support/station/mibs/iDRAC-SMIv2.mib
cp /usr/share/snmp/mibs/SNMPv2-SMI.txt /root/support/station/mibs/
<3>生成snmp.yml
官方地址:
https://github.com/prometheus/snmp_exporter/tree/main/generator#file-format
# 配置变量
export GO111MODULE=on
export GOPROXY=https://mirrors.aliyun.com/goproxy/
export MIBDIRS=/root/support/station/mibs/
#拉取generator
go get github.com/prometheus/snmp_exporter/generator
cd ${GOPATH-$HOME/go}/pkg/mod/github.com/prometheus/snmp_exporter@v0.20.0/generator
go build
#编辑generator.yml
(community要设置为你idrac的snmp团体名)
vim generator.yml
modules:
idrac:
walk:
- 1.3.6.1
version: 2
timeout: 30s
auth:
community: public
#生成监控指标
./generator generate
cp -r snmp.yml /data/snmp_exporter/
<4>启动snmp_exporter
systemctl restart snmp-exporter
/etc/init.d/snmp_exporter restart
<5>测试指标抓取是否正常
备注:
Target填入要抓取的服务器的远程管理卡ip,服务器内部配置的网卡的ip无效
Module:填入该snmp的模块,snmp.yml文件中walk上面
如果你部分的服务器snmp的密码是其他的,建议拷贝一个新的snmp文件,修改文件最末尾的community: xxx
cat snmp.yml
Prometheus配置
prometheus不管用什么方式,如果安装了就不需要再次安装了,重点是 prometheus.yml中添加一段idrac配置
可能以后再写prometheus监控及报警相关文档
prometheus配置
<1>配置从何处读取报警规则
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "rule/*.yml"
# - "second_rules.yml"
创建报警规则的目录,在目录中写入报警规则的文件
mkdir rule
vim idrac.yml
<2>配置job,设置要收集或排除的指标
方式一
static_configs方式
- job_name: 'IDRAC'
scrape_interval: 180s #抓取数据的间隔
scrape_timeout: 180s #抓取数据的超时时间
static_configs:
- targets:
- 123.123.123.123 #要监控的idrac ip,默认snmp端口161
# - 123.123.123.123:161 #如果是其他端口,也可以加端口
# labels: #labels可根据需求添加标签,例如该idrac对应的内部ip,工作机房等
# IP: 'xxx'
# project: 'xxx'
metrics_path: /snmp
params:
module: [dell] #
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: xxxxx:9116 #你的snmp_exporter服务器
该模式特点,要监控哪几台就需要在targets添加几台.如果是几百台会导致prometheus.yml文件行数特别多
方式二
file_sd_configs方式
- job_name: "IDRAC"
params:
module:
- idrac
scrape_interval: 180s
scrape_timeout: 180s
metrics_path: /snmp
file_sd_configs:
- files:
- targets/*.json #读取json文件,目录名称任意,但是得创建
refresh_interval: 5m #该文件载入时间,多长时间载入一次
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: xxxx:9116 #你的snmp_exporter服务器
该模式特点,需要创建json文件,监控项写入json文件,json格式如下:
cat targets/idrac.json
[
{
"targets": [
"123.123.123.123:161"
],
"labels": {
"IP": "xxxx",
"Project": "xxx"
}
},
{
"targets": [
"123.123.123.124:161"
],
"labels": {
"IP": "xxx",
"Project": "xxx"
}
}
]
or
[
{
"targets": [
"123.123.123.123:161",
"123.123.123.124:161"
],
"labels": {
"IP": "xxxx",
"Project": "xxx"
}
}
]
方式三
consul_sd_file方式
该方式是将监控注册到consul服务中,prometheus通过consul实现服务的自动发现
这里就不详细介绍consul,没有使用过consul和配置过prometheus报警的暂时不建议看这个方式,不容易理解
- job_name: 'IDRAC'
params:
module:
- idrac
scrape_interval: 180s
scrape_timeout: 180s
metrics_path: /snmp
consul_sd_configs:
- server: 'monitor-consul.com:8500' #这个是你consul服务的域名,也可直接填入ip
tag_separator: ','
services: []
relabel_configs:
- source_labels: [__meta_consul_tags]
regex: .*idrac.* #这个是将你consul打的tags中符合该正则的指标归类到该Job
action: keep
- source_labels: ['__meta_consul_service_metadata_eth-ip'] #这个是你consul打的标签,在prometheus -> Targets -> IDRAC ->Endpoint展示出来
target_label: __param_target
- source_labels: ['__meta_consul_service_address']
target_label: instance
- target_label: __address__
replacement: xxx:9116
该模式特点,需要将服务注册到consul,有静态和文件两种注册方式:
json示例如下,根据需求写自己的(标签随意,但要符合你报警的钉钉群的关键字,符合alertmanger相关配置)
cat consul-idrac.json
{
"ID": "IDRAC-xxx",
"Name": "IDARC-xxx",
"Tags": [
"idrac"
],
"Address": "xxx", #IDRAC IP
"Meta": { #consul里的标签,之后标签会重写成prometheus的标签
"eth-ip":"xxx", #服务器业务ip
"project":"beijing" #所在地
},
"EnableTagOverride": false,
"Check": {
"HTTP": "http://xxxx:9116/metrics", #你的snmp服务器IP和端口.健康检查
"Interval": "10s"
},
"Weights": {
"Passing": 10,
"Warning": 1
}
}
说明:由于健康检查使用的是snmp_exporter实际上检查的是snmp_exporter,因此哪怕前面的IP等内容是错误的,consul状态也是正常.不过不影响prometheus去监控,服务注册到consul后,它只是从consul获取服务的值和标签,然后prometheus再根据自己的配置去进行监控.对于snmp适合第二种json
or
cat consul-idrac2.json
{
"ID": "IDRAC-xxx",
"Name": "IDARC-xxx",
"Tags": [
"idrac"
],
"Address": "xxx:161",
"Meta": { #consul里的标签,之后标签会重写成prometheus的标签
"eth-ip":"xxx", #服务器业务ip
"project":"beijing" #所在地
}
}
注册
curl --request PUT --data @consul-idrac.json http://monitor-consul.com:8500/v1/agent/service/register?replace-existing-checks=1
取消注册
curl -X PUT http://monitor-consul.com:8500/v1/agent/service/deregister/IDRAC-xxx
效果:
报警规则配置
注意你的snmp.yml中的指标,但是并不是所有的指标都可使用,可以在prometheus上搜索一下
cat rule/idrac.yml
groups:
- name: IDRAC-物理机硬件运行状态
rules:
- alert: IDRAC状态
expr: up{job=~"IDRAC.*"} == 0
for: 1m
labels:
status: error
annotations:
description: "{{$labels.instance}} IDRAC异常"
- alert: 机箱组件整体状态
expr: chassisStatus != 3
for: 1m
labels:
status: error
annotations:
summary: "机箱组件总体运行状态异常请及时查看!!"
description: "{{$labels.instance}}机箱组件异常"
- alert: 机箱CMOS电池整体状态
expr: systemBatteryStatus != 3
for: 1m
labels:
status: error
annotations:
summary: "机箱CMOS电池整体状态异常请及时查看!!"
description: "{{$labels}}机箱CMOS电池状态异常"
- alert: 内存条运行状态
expr: memoryDeviceStatus != 3
for: 1m
labels:
status: error
annotations:
summary: "内存条状态异常请及时查看!!"
description: "{{$labels.instance}} 内存条 {{$labels.memoryDeviceIndex}}异常"
- alert: 处理器CPU总体状态
expr: processorDeviceStatusStatus != 3
for: 1m
labels:
status: error
annotations:
summary: "处理器CPU总体状态异常请及时查看!!"
description: "{{$labels.instance}} 处理器CPU{{$labels.processorDeviceStatusIndex}}异常"
- alert: 网卡状态
expr: networkDeviceStatus != 3
for: 1m
labels:
status: error
annotations:
description: "{{$labels.instance}} 网卡{{$labels.networkDeviceIndex}}异常"
- alert: ps电源总体状态
expr: powerSupplyStatus != 3
for: 1m
labels:
status: error
annotations:
summary: "ps电源总体状态异常请及时查看!!"
description: "{{$labels.instance}} ps电源 {{ $labels.powerSupplyIndex }}状态异常"
- alert: 存储控制器总体状态
expr: globalStorageStatus != 3
for: 1m
labels:
status: error
annotations:
summary: "存储控制器状态异常请及时查看!!"
description: "{{$labels.instance}} 存储控制器异常"
- alert: 物理系统组件总体状态
expr: globalSystemStatus != 3
for: 1m
labels:
status: error
annotations:
summary: "物理系统总体组件运行状态异常请及时查看!!"
description: "{{$labels.instance}} 物理系统组件异常"
- alert: 物理磁盘运行状态
expr: physicalDiskState != 3
for: 1m
labels:
status: error
annotations:
summary: "物理磁盘运行状态异常请及时查看!!"
description: "{{$labels.instance}} 物理磁盘{{$labels. physicalDiskNumber}}异常"
- alert: 虚拟磁盘运行状态
expr: virtualDiskState != 2
for: 1m
labels:
status: error
annotations:
summary: "虚拟磁盘运行状态异常请及时查看!!"
description: "{{$labels.instance}} 虚拟磁盘{{$labels.virtualDiskNumber}}异常"
重新加载prometheus
curl -X POST http://xxxx:9090/-/reload #prometheus的IP
要想报警 还需要配置 报警插件Alertmanager 和 钉钉插件prometheus-webhook-dingtalk ,并在dingding群添加机器人.这里就不演示报警流程了.
补充
关于出现snmp数据抓取的异常,需要注意几点:
<1>snmp_exporter服务器是否可访问要监控服务器的远程管理卡
<2>注册的json的IP是idrac的IP不要是服务器内部IP
<3>远程管理卡的snmp密码,是否你的snmp.yml中的community
<4>监控的超时时间需要根据网络情况调整,否则会因超时频繁报警
<5>如有大量idrac需监控,可配置多个job_name监控分流
<6>报警信息可根据需求调整,报警没的出来查看Alertmanager和prometheus-webhook-dingtalk配置