本期新东方的技术朋友(董召宁&齐晨)分享了一个硬件监控的方案,使用 Telegraf 做数据采集,使用 Loki 做日志存储,使用 Nightingale 做告警规则配置,玩的挺花的,我们一起来学习一下吧。也欢迎其他技术朋友投稿给我们,一起分享有意思的监控、可观测性方案。

方案概述

物理机可以通过带外开启 SNMP,通过 SNMP 可以获取各个硬件模块的健康状态,Telegraf 提供了很多 input、output 插件,我们可以使用 Telegraf 的 snmp input 插件采集硬件状态信息,使用 loki output 插件写入 Loki,然后使用 Nightingale 做告警规则配置( Loki 兼容 Prometheus 的 Querying 接口),产出告警事件,之后可以发给钉钉、企微,或者发给 FlashDuty 做告警聚合降噪、排班、认领、升级等后续处理。

新东方老师教你使用 Telegraf + Loki + Nightingale 实现硬件监控_Telegraf

实操步骤

1. snmp 插件准备

snmp监控主要通过各种oid节点获取对应信息,分为get(单值)、walk(多值),telegraf的snmp插件默认是取单值,如果需要多值,可以找对应的table类型节点进行采集。

# 单值节点
[[inputs.snmp.field]]
name="uptime"
oid=".1.3.6.1.2.1.1.3.0"
  
# 多值节点,table类型
[[inputs.snmp.table]]
oid = ".1.3.6.1.2.1.31.1.1"
name = "interface"

2. OID节点查找

不同型号的物理机,oid节点不同,可以通过对应型号的MIB文件进行查找。除了单独查找每个硬件的状态oid外,MIB文件中一般还能找到服务器整体状态的OID节点,可以直接取这个值,以下例子为浪潮服务器:

新东方老师教你使用 Telegraf + Loki + Nightingale 实现硬件监控_Nightingale_02

正常取值为"OK"、“Normal”,有报警的话取值为"WARNING"、“CRITICAL”。

采集OID=‘INSPUR-MIB::serverSystemHealthTable’(转换数字为’.1.3.6.1.4.1.37945.2.1.2.13.1’),可以先用snmpwalk命令看下取值:

#正常节点
snmpwalk -v3  1.1.1.1 INSPUR-MIB::serverSystemHealthTable
INSPUR-MIB::serverCurPowerState."" = STRING: "Power On"
INSPUR-MIB::serverUIDState."" = STRING: "UID Off"
INSPUR-MIB::serverCPUState."" = STRING: "OK"
INSPUR-MIB::serverMemoryState."" = STRING: "OK"
INSPUR-MIB::serverHDDState."" = STRING: "OK"
INSPUR-MIB::serverFANState."" = STRING: "OK"
INSPUR-MIB::serverPSUState."" = STRING: "OK"
INSPUR-MIB::serverRAIDState."" = STRING: "OK"
INSPUR-MIB::serverTempState."" = STRING: "OK"
INSPUR-MIB::serverHealthState."" = STRING: "OK"

#异常节点
snmpwalk -v3  2.2.2.2 INSPUR-MIB::serverSystemHealthTable
INSPUR-MIB::serverCurPowerState."" = STRING: "Power On"
INSPUR-MIB::serverUIDState."" = STRING: "UID Off"
INSPUR-MIB::serverCPUState."" = STRING: "OK"
INSPUR-MIB::serverMemoryState."" = STRING: "WARNING"
INSPUR-MIB::serverHDDState."" = STRING: "OK"
INSPUR-MIB::serverFANState."" = STRING: "OK"
INSPUR-MIB::serverPSUState."" = STRING: "OK"
INSPUR-MIB::serverRAIDState."" = STRING: "OK"
INSPUR-MIB::serverTempState."" = STRING: "OK"
INSPUR-MIB::serverHealthState."" = STRING: "WARNING"
INSPUR-MIB::serverCPUStandardStatus."" = STRING: "Normal"
INSPUR-MIB::serverMemoryStandardStatus."" = STRING: "Warning"
INSPUR-MIB::serverHDDStandardStatus."" = STRING: "Normal"
INSPUR-MIB::serverFANStandardStatus."" = STRING: "Normal"
INSPUR-MIB::serverPSUStandardStatus."" = STRING: "Normal"
INSPUR-MIB::serverRAIDStandardStatus."" = STRING: "Normal"
INSPUR-MIB::serverTempStandardStatus."" = STRING: "Normal"
INSPUR-MIB::serverHealthStandardStatus."" = STRING: "Warning"

通过采集状态可以看出 2.2.2.2 这个机子memory为报警状态。

3. telegraf配置

针对找出的OID节点,编辑telegraf配置文件:

[[inputs.snmp]]
agents = ["udp://1.1.1.1:161","udp://2.2.2.2:161"]
timeout = "20s"
version = 3
sec_name = "xxx"
auth_protocol = "MD5"
auth_password = "xxx"
sec_level = "authPriv"
priv_protocol = "DES"
priv_password = "xxx"

[[inputs.snmp.table]]
oid = "INSPUR-MIB::serverSystemHealthTable"
name = "lc_syshealth"  #自定义指标名字

测试telegraf插件:

[root]# /opt/telegraf/telegraf --config ./telegraf-lctest.conf  --test  --input-filter snmp
2023-03-16T01:58:55Z I! Starting Telegraf 1.20.4
> lc_syshealth,agent_host=1.1.1.1,serverCurPowerState=Power\ On serverCPUState="OK",serverFANState="OK",serverHDDState="OK",serverHealthState="OK",serverMemoryState="OK",serverPSUState="OK",serverRAIDState="OK",serverTempState="OK",serverUIDState="UID Off" 1678931943000000000
> lc_syshealth,agent_host=2.2.2.2,serverCurPowerState=Power\ On serverCPUStandardStatus="Normal",serverCPUState="OK",serverFANStandardStatus="Normal",serverFANState="OK",serverHDDStandardStatus="Normal",serverHDDState="OK",serverHealthStandardStatus="Warning",serverHealthState="WARNING",serverMemoryStandardStatus="Warning",serverMemoryState="WARNING",serverPSUStandardStatus="Normal",serverPSUState="OK",serverRAIDStandardStatus="Normal",serverRAIDState="OK",serverTempStandardStatus="Normal",serverTempState="OK",serverUIDState="UID Off" 1678931955000000000

4. 信息录入和报警规则

从采集结果看,采集的信息为string类型,无法形成metrics指标,可以将结果录入到es或者loki等日志系统中进行分析,本文采用的是loki。Telegraf的loki output插件配置如下:

[[outputs.loki]]
  domain = "http://xxx"
  endpoint = "/loki/api/v1/push"
  data_format = "json"

n9e添加loki的数据源,v6直接在页面添加数据源(系统配置-数据源)即可,v5的话则是在配置文件,v5需要同时修改webapi.conf和server.conf,下面是 webapi.conf 的配置样例:

[[Clusters]]
Name = "loki"
Prom = "http://xxx/loki"
BasicAuthUser = ""
BasicAuthPass = ""
Timeout = 30000
DialTimeout = 10000
TLSHandshakeTimeout = 30000
ExpectContinueTimeout = 1000
IdleConnTimeout = 90000
KeepAlive = 30000
MaxConnsPerHost = 0
MaxIdleConns = 100
MaxIdleConnsPerHost = 100

在n9e上配置告警规则:

新东方老师教你使用 Telegraf + Loki + Nightingale 实现硬件监控_Nightingale_03

将告警对接到统一的告警通道就可以展示和报警了,下图是我们的告警事件截图,当然,您也可以直接发给钉钉、企微,或者 FlashDuty 等事件中心。

新东方老师教你使用 Telegraf + Loki + Nightingale 实现硬件监控_Telegraf_04