本节重点介绍 :

  • 启动3个alert_receive接收端
  • 在alertmanager配置文件中编写相关路由
  • prometheus编写rule文件触发告警
  • 观察3个接收端
  • 5001 收到 alert_g_1
  • 5002 收到 alert_g_2
  • 5003 收到 alert_g_1 和 alert_g_2

分组说明

  • alertmanager可以根据设置的路由将告警可以分组处理,发送给对应的接收端
  • 三个接收组
  • sre_system接收机器告警,对应 job=node_exporter
  • sre_dba接收数据库告警,对应 job=mysqld_exporter
  • sre_all接收所有告警,对应 job=~ .*

分组实验

启动多个告警的webhook接收端,对应多个receiver

  • 之前我们写的alert_receive.go,编译成 alert_receive二进制
  • –addr指定 地址启动3个进程
  • ./alert_receive --addr=:5001
  • ./alert_receive --addr=:5002
  • ./alert_receive --addr=:5003

在alertmanager配置文件中编写相关路由

# 写配置文件
cat <<-"EOF" > /opt/app/alertmanager/alertmanager.yml
global:
  resolve_timeout: 30m

route:
  group_by: ['alertname']
  group_wait: 5s
  group_interval: 5s
  repeat_interval: 1h
  receiver: 'sre_all'
  routes:                                       #子路由,父路由的所有属性都会被子路由继承
    - match_re:                                   #此路由在警报标签上执行正则表达式匹配,以捕获与服务列表相关的警报
        job: node_exporter
      receiver: sre_system
      # continue=true 代表继续向下匹配,不然就break了
      continue: true
    - match_re:
        job: mysqld_exporter
      receiver: sre_dba
      continue: true
      # 默认all路由
    - match_re:
        job: .*
      receiver: sre_all
      continue: true


receivers:
- name: 'sre_system'
  webhook_configs:
  - url: 'http://127.0.0.1:5001/alert'
- name: 'sre_dba'
  webhook_configs:
  - url: 'http://127.0.0.1:5002/alert'
- name: 'sre_all'
  webhook_configs:
  - url: 'http://127.0.0.1:5003/alert'
EOF

# reload
curl -X POST -vvv  localhost:9093/-/reload
  • 解读一下
  • job=node_exporter 由 sre_system处理 5001端口
  • job=mysqld_exporter 由 sre_dba处理 5002端口
  • 所有的告警 由 sre_all处理 5003端口
  • 重新加载alertmanager配置文件

准备prometheus 规则文件,触发告警

准备rule文件

cat <<EOF > /opt/app/prometheus/rule.yml
groups:
- name: alert_g_1
  rules:
  - alert: node_load too high
    expr:  node_memory_Active_bytes{instance="192.168.3.200:9100", job="node_exporter"}>0
    labels:
      severity: critical
      node_name: abc
    annotations:
      summary: 机器太累了

- name: alert_g_2
  rules:
  - alert: mysql_qps too high
    expr: mysql_global_status_queries{instance="192.168.3.200:3306", job="mysql_exporter"} >0
    labels:
      severity: warning
      node_name: abc
    annotations:
      summary: mysql太累了


EOF
  • 其中alert_g_1由job=node_exporter触发
  • 其中alert_g_2由job=mysqld_exporter触发

修改prometheus主配置文件,生效rule和alertmanager

# 写配置文件
cat <<EOF > /opt/app/prometheus/prometheus.yml

global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - 172.20.70.215:9093

rule_files:
  - /opt/app/prometheus/rule.yml


scrape_configs:


  - job_name: node_exporter
    honor_timestamps: true
    scrape_interval: 15s
    scrape_timeout: 10s
    #metrics_path: /metrics
    #scheme: http
    static_configs:
    - targets:
      - 172.20.70.205:9100

  - job_name: mysqld_exporter
    honor_timestamps: true
    scrape_interval: 15s
    scrape_timeout: 10s
    #metrics_path: /metrics
    #scheme: http
    static_configs:

    - targets:
      - 172.20.70.205:9104
EOF

# reload
curl -X POST -vvv  localhost:9090/-/reload

效果展示

期望效果

  • 5001 收到 alert_g_1
  • 5002 收到 alert_g_2
  • 5003 收到 alert_g_1 和 alert_g_2

实际效果

  • 效果图片

本节重点总结 : alertmanager分组

  • 启动3个alert_receive接收端
  • 在alertmanager配置文件中编写相关路由
  • prometheus编写rule文件触发告警
  • 观察3个接收端
  • 5001 收到 alert_g_1
  • 5002 收到 alert_g_2
  • 5003 收到 alert_g_1 和 alert_g_2