说明

基本测试环境部署,详见:Prometheus+Alertmanager+PrometheusAlert+飞书容器化部署实战

关于飞书机器人使用,详见:自定义机器人指南

关于 PrometheusAlet 自定义模板详见:自定义告警消息模版使用说明

关于 PrometheusAlert 部署详见:Kubernetes中部署PrometheusAlert并使用mysql作后端存储,如果只是测试,可以直接使用sqlite3

PrometheusAlert 自定义模板

使用该功能需要使用者对go语言的template模版有一些初步了解,可以参考默认模版的一些语法来进行自定义。

模版数据等信息均存储在程序目录的下的db/PrometheusAlertDB.db中。

获取飞书机器人webhook地址

使用飞书创建自己的通知机器人,详见:PrometheusAlert飞书配置

验证PrometheusAlert发送信息

打开飞书告警通道,更改 PrometheusAlert yaml中配置文件:

#是否开启飞书告警通道,可同时开始多个通道0为关闭,1为开启
open-feishu=1
#默认飞书机器人地址
fsurl=https://open.feishu.cn/open-apis/bot/v2/hook/1234-xxxx-xxxx-xxxx
# 重新apply
kubectl delete -f PrometheusAlert-Deployment.yaml
kubectl apply -f PrometheusAlert-Deployment.yaml

浏览器访问:http://your-IP:30036/test

PrometheusAlert+pushgateway接入实战

这时,我们就可以在飞书上看到测试信息了,如下:

PrometheusAlert+pushgateway接入实战

自定义数据推送pushgateway

我们这里为了测试,简单定义几条数据,并将其推送至pushgateway,方便我们验证自定义模板功能。

# 定义数据文件
# cat pgdata.txt 
# TYPE http_request_total counter
# HELP http_request_total get interface request count with different code.
cat_bad{cat_bad="1900",interface="/v1/save"} 1900
cmd_original_dire_num{cmd_original_dire_num="200",interface="/v1/delete"} 200
# TYPE http_request_time gauge
# HELP http_request_time get core interface http request time.
# title
mark{cat_bad="1900",cmd_original_dire_num="200"} 1

更多的 pushgateway 使用,详见基于Prometheus的Pushgateway实战

自定义模板

浏览器访问 PrometheusAlert 模板:http://your-IP:30036/template,点击“添加模板”,新建一个名为ai-01-fs的模板,点击“保存模板”,如下:

PrometheusAlert+pushgateway接入实战

更新 Alertmanager 配置

更改你的Alertmanager的配置,将所有告警信息都转发到PrometheusAlert自定义接口,参考如下:

# cat /data/test/alertmanager/etc/alertmanager.yml

global:
  resolve_timeout: 2m
  smtp_from: '319981932@qq.com'
  smtp_smarthost: 'smtp.qq.com:465'
  smtp_auth_username: '319981932@qq.com'
  # 注意这里需要配置QQ邮箱的授权码,不是登录密码,授权码在账户配置中查看
  smtp_auth_password: 'abcdefghijklmmop'
  smtp_require_tls: false

route:
  #group_by: ['alert_node']
  group_wait: 5s
  group_interval: 5s
  repeat_interval: 5m
  receiver: 'prometheusalert'

receivers:
- name: 'prometheusalert'
  webhook_configs:
  - url: 'http://your-ip:30036/prometheusalert?type=fs&tpl=ai-01-fs&fsurl=https://open.feishu.cn/open-apis/bot/v2/hook/1234-xxxx-xxxx-xxxx'
# 重启 alertmanager 服务
docker restart alertmanager
# 查看 alertmanager 服务日志
docker logs alertmanager

更新 Prometheus 规则文件,添加 pushgateway 报警规则,如下:

# cat /data/test/prometheus/rules/alert-node-rules.yml

# 添加如下:
- alert: test
      expr: mark{ job="pushgateway"} == 1
      for: 1m
      labels:
        test: mv
      annotations:
        description: "this is a test02"

重启 Prometheus 容器:

docker restart prometheus

验证自定义数据报警

上面我们已经完成了基本告警条件准备工作,下面我们通过客户端 POST数据至pushgateway,prometheus拉取pushgateway里数据,经过alertmanager报警规则触发,到prometheusalert自定义ai-01-fs模板,最后飞书机器人发送该报警信息,流程如下:

PrometheusAlert+pushgateway接入实战

客户端发送数据

# 定义的 job_name: 'pushgateway'
curl -XPOST --data-binary @pgdata.txt http://your-IP:30037/metrics/job/pushgateway

查看pushgateway

浏览器访问:http://your-IP:30037/

PrometheusAlert+pushgateway接入实战

如果想删除该监控指标,点击页面右上角“Delete Group”,或命令行:

# 删除所有指标:
curl -X DELETE http://your-ip:30037/metrics/job/pushgateway

查看Prometheus

浏览器访问:http://your-IP:9090/

PrometheusAlert+pushgateway接入实战

查看AlertManager

浏览器访问:http://your-IP:9093/#/alerts

PrometheusAlert+pushgateway接入实战

查看飞书报警信息

PrometheusAlert+pushgateway接入实战

关于自定义模板变量

PrometheusAlert上也有说,我这里跟大家再简单说明下。

进入 PrometheusAlert 容器中查看收到的日志消息,当我们在客户端 POST 数据时,动态打印 prometheusalert logs:

kubectl logs prometheus-alert-center-6966747c-b48f7 -n kube-mon -f

可以拿到如下数据:

2021/11/05 03:35:34.364 [D] [value.go:476]  [1636083334364061744] {"receiver":"prometheusalert","status":"resolved","alerts":[{"status":"resolved","labels":{"alertname":"test","cat_bad":"1900","cmd_original_dire_num":"200","exported_job":"pushgateway","instance":"your-IP:30037","job":"pushgateway","test":"mv"},"annotations":{"description":"this is a test02"},"startsAt":"2021-11-05T03:34:39.341Z","endsAt":"2021-11-05T03:35:24.341Z","generatorURL":"http://xxxxx:9090/graph?g0.expr=mark%7Bjob%3D%22pushgateway%22%7D+%3D%3D+1\u0026g0.tab=1","fingerprint":"xxxxxxx"}],"groupLabels":{},"commonLabels":{"alertname":"test","cat_bad":"1900","cmd_original_dire_num":"200","exported_job":"pushgateway","instance":"yourIP:30037","job":"pushgateway","test":"mv"},"commonAnnotations":{"description":"this is a test02"},"externalURL":"http://xxxxx:9093","version":"4","groupKey":"{}:{}","truncatedAlerts":0}

继续截取日志中的JSON内容,通过任意json格式化工具进行格式化如下:

{
    "receiver": "prometheusalert",
    "status": "resolved",
    "alerts": [{
        "status": "resolved",
        "labels": {
            "alertname": "test",
            "cat_bad": "1900",
            "cmd_original_dire_num": "200",
            "exported_job": "pushgateway",
            "instance": "your-IP:30037",
            "job": "pushgateway",
            "test": "mv"
        },
        "annotations": {
            "description": "this is a test02"
        },
        "startsAt": "2021-11-05T03:34:39.341Z",
        "endsAt": "2021-11-05T03:35:24.341Z",
        "generatorURL": "http://xxxxx:9090/graph?g0.expr=mark%7Bjob%3D%22pushgateway%22%7D+%3D%3D+1\u0026g0.tab=1",
        "fingerprint": "xxxxxxx"
    }],
    "groupLabels": {},
    "commonLabels": {
        "alertname": "test",
        "cat_bad": "1900",
        "cmd_original_dire_num": "200",
        "exported_job": "pushgateway",
        "instance": "yourIP:30037",
        "job": "pushgateway",
        "test": "mv"
    },
    "commonAnnotations": {
        "description": "this is a test02"
    },
    "externalURL": "http://xxxxx:9093",
    "version": "4",
    "groupKey": "{}:{}",
    "truncatedAlerts": 0
}

然后对照该JSON开始编写模版,并在Dashboard上进行添加,示例模版如下:

ES_1_标准英语 数据分类比对完成!,Json分类情况=详见日志信息
{{ range $k,$v:=.alerts }}
###### Json分类情况:{{$v.labels.cat_bad}}
###### 原始目录个数:{{$v.labels.cmd_original_dire_num}}
{{ end }}

上面的JSON内容也可以粘贴在Dashboard上“消息协议JSON内容“,用来模拟测试。添加完自定义模板后,主要一定要点击保存。

Alertmanager结合PrometheusAlert模板做基于标签路由分发

我们创建两个机器人,将其加入不同的群聊中。

更新Prometheus匹配规则

# cat /data/test/prometheus/rules/alert-node-rules.yml

    - alert: test01
      expr: mark{ job="pushgateway"} == 1
      for: 1m
      labels:
        test: ceshi-01
      annotations:
        description: "this is a test01"

    - alert: test02
      expr: ttmark{ job="pushgateway"} == 1
      for: 1m
      labels:
        test: ceshi-02
      annotations:
        description: "this is a test02"
# 重启prometheus服务
docker restart prometheus
# 查看prometheus服务日志
docker logs prometheus

更新AlertManager路由

global:
  resolve_timeout: 2m
  smtp_from: '319981932@qq.com'
  smtp_smarthost: 'smtp.qq.com:465'
  smtp_auth_username: '319981932@qq.com'
  # 注意这里需要配置QQ邮箱的授权码,不是登录密码,授权码在账户配置中查看
  smtp_auth_password: 'abcdefghijklmmop'
  smtp_require_tls: false

route:
  #group_by: ['alert_node']
  group_wait: 5s
  group_interval: 5s
  repeat_interval: 5m
  receiver: 'default'
  routes:
  - matchers:
    - cmd_original_dire_num="200"
    receiver: 'prometheusalert-02'
  - matchers:
    - cmd_original_dire_num="300"
    receiver: 'prometheusalert-01'

receivers:
- name: 'default'
  wechat_configs:
  - corp_id: 'your-id'
    to_user: '1234567'
    agent_id: your-id # 注意是int
    api_secret: 'your-secret'
    send_resolved: true

- name: 'prometheusalert-01'
  webhook_configs:
  - url: 'http://your-IP:30036/prometheusalert?type=fs&tpl=ai-01-fs&fsurl=https://open.feishu.cn/open-apis/bot/v2/hook/xxxx-xxxx'

- name: 'prometheusalert-02'
  webhook_configs:
  - url: 'http://your-IP:30036/prometheusalert?type=fs&tpl=ai-01-fs&fsurl=https://open.feishu.cn/open-apis/bot/v2/hook/yyyy-yyyy'
# 重启alertmanager服务
docker restart alertmanager
# 查看alertmanager服务日志
docker logs alertmanager

客户端发送数据

# cat pgdata.txt 
# TYPE http_request_total counter
# HELP http_request_total get interface request count with different code.
cat_bad{cat_bad="1900",interface="/v1/save"} 1900
cmd_original_dire_num{cmd_original_dire_num="200",interface="/v1/delete"} 200
# TYPE http_request_time gauge
# HELP http_request_time get core interface http request time.
# title
mark{cat_bad="1900",cmd_original_dire_num="200"} 1
# cat test.txt 
# TYPE http_request_total counter
# HELP http_request_total get interface request count with different code.
cat_bad{cat_bad="1800",interface="/v1/save"} 1900
cmd_original_dire_num{cmd_original_dire_num="100",interface="/v1/delete"} 200
# TYPE http_request_time gauge
# HELP http_request_time get core interface http request time.
# title
ttmark{cat_bad="1800",cmd_original_dire_num="300"} 1
# post data
curl -XPOST --data-binary @pgdata.txt http://your-IP:30037/metrics/job/pushgateway
curl -XPOST --data-binary @test.txt http://your-IP:30037/metrics/job/pushgateway

验证

浏览器打开pushgateway:http://your-IP:30037/

PrometheusAlert+pushgateway接入实战

飞书报警信息:

PrometheusAlert+pushgateway接入实战

PrometheusAlert+pushgateway接入实战