AlertManager安装及告警发送

Prometheus触发一条告警的过程
prometheus—> 触发阈值—>超出持续时间—>AlertManager—>分组|抑制|静默—>媒体类型—>邮件|钉钉|微信

分组(group): 将性质类似的报警合并为单个通知,比如网络通知,主机通知,服务通知
静默(silences): 是一种简单的特定时间静音机制.例如服务器维护前设置静默,晚上定时更新时设置静默
抑制(inhibition):当报警发出后,停止重复发送由此告警引发的其他告警,即合并一个故障引起的多个报警时间,可以消除冗余告警

1. 安装AlertManager

可以和prometheus在同一个服务器或不同服务器,只要网络能通就可以.

# cd /apps
# wget https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gz
# tar xf alertmanager-0.24.0.linux-amd64.tar.gz
# ln -sf /apps/alertmanager-0.24.0.linux-amd64 /apps/alertmanager
# ln -sf /apps/alertmanager-0.24.0.linux-amd64/alertmanager /usr/bin/

2. 配置service文件

/etc/systemd/system/alertmanager.service

[Unit]
Description=Prometheus AlertManager
After=network.target

[Service]
Restart=on-failure
ExecStart=/usr/bin/alertmanager --config.file=/apps/alertmanager/alertmanager.yml

[Install]
WantedBy=multi-user.target

启动服务

# systemctl enable --now alertmanager.service

3. 告警通知

3.1 邮件告警

参数

含义

smtp_from

发件人邮箱

smtp_smarthost

邮件服务器smtp地址

smtp_auth_username

发件人账号

smtp_auth_password

发件人密码

smtp_require_tls

是否需要tls协议

wechart_api_url

企业微信API地址

wechart_api_secret

企业微信API SECRET

wechat_api_corp_id

企业微信CORP ID信息

resolve_timeout: 60s

当一个告警在AlertManager持续多久未接受到新告警后标记告警状态为resolved

group_by: [‘alertname’]

根据标签进行分组

group_wait: 10s

组告警等待时间,告警产生后等待10秒,10秒内有相同告警则合并发送

group_interval: 1m

第二次产生告警,间隔1m如果没有恢复,进入repeat_interval

repeat_interval: 2m

当第一条告警成功发送后,等待2分钟,如果没有恢复,再发送

severity: ‘critical’

告警级别

alertmanager.yml 
global:
  # 在没有报警的情况下声明为已解决的时间
  resolve_timeout: 1m
  # 配置邮件发送信息
  smtp_smarthost: 'smtp.qq.com:465'
  smtp_from: '2841809255@qq.com'
  smtp_auth_username: '2841809255@qq.com'
  smtp_auth_password: 'yujbvgdfhwd0bdij'
  smtp_hello: 'qq.com'
  smtp_require_tls: false

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 2m
  receiver: 'web.hook'
receivers:
- name: 'web.hook'
  email_configs:
    - to: '13917099322@139.com'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

3.1.1 配置告警规则

# mkdir /apps/prometheus/rules
# vi /apps/prometheus/rules/pods_rule.yaml

rule配置

# 监控项说明
   - alert: node内存可用大小
   # prosql 表达式
      expr: node_memory_MemFree_bytes < 2*1024*1024*1024
   # 采样间隔
      for: 15s
   # 标签,告警级别
      labels:
        severity: critical
   # 告警内容
      annotations:
        description: node可用内存小于2G

3.1.2 配置prometheus

修改prometheus.yaml

  1. 设置alertmanager
  2. 设置rule路径
3.1.2.1 设置alertmanager

altermanager之前是装在prometheus同一个服务器上,也可以另外装在其他服务器上.只需修改这里的IP地址即可.

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
           - 192.168.31.201:9093
3.1.2.2 指定rule位置

这里rule可以写相对路径,也可以写绝对路径

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
   - "/apps/prometheus/rules/pods_rule.yaml"
3.1.2.3 重启prometheus
# systemctl restart prometheus.service

此时访问altermanager可以看到告警已经产生

grafana 钉钉报警显示折叠 alertmanager钉钉报警_linux

登录邮箱,发现已经收到了2条来自于prometheus的告警邮件

grafana 钉钉报警显示折叠 alertmanager钉钉报警_grafana 钉钉报警显示折叠_02

在prometheus的alters中也可以看到相关告警.

grafana 钉钉报警显示折叠 alertmanager钉钉报警_linux_03

3.2 钉钉告警

3.2.1 添加机器人

grafana 钉钉报警显示折叠 alertmanager钉钉报警_微信_04

grafana 钉钉报警显示折叠 alertmanager钉钉报警_云原生_05

grafana 钉钉报警显示折叠 alertmanager钉钉报警_kubernetes_06


grafana 钉钉报警显示折叠 alertmanager钉钉报警_grafana 钉钉报警显示折叠_07


grafana 钉钉报警显示折叠 alertmanager钉钉报警_云原生_08


grafana 钉钉报警显示折叠 alertmanager钉钉报警_微信_09


grafana 钉钉报警显示折叠 alertmanager钉钉报警_云原生_10

3.2.2 钉钉脚本

#!/bin/bash
source /etc/profile
MESSAGE=$1

/usr/bin/curl -X "POST" 'https://oapi.dingtalk.com/robot/send?access_token=1179c64f197a5da70d4b393111dd43298e58f8112e22f3e00d6632591337c43a'\
  -H 'Content-Type: application/json'  \
  -d '{"msgtype": "text",
      "text": {
          "content": "'${MESSAGE}'"
    }
}'

测试发送告警

root@prometheus-2:/apps/alertmanager# ./dingding.sh "alertname,测试测试"
{"errcode":0,"errmsg":"ok"}root@prometheus-2:/apps/alertmanager# 
root@prometheus-2:/apps/alertmanager#

3.2.3 部署webhook-dingtalk

prometheus-webhook-dingtalk

root@prometheus-2:/apps# wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v1.4.0/prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz
root@prometheus-2:/apps# tar xf prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz
root@prometheus-2:/apps# ln -sf /apps/prometheus-webhook-dingtalk-1.4.0.linux-amd64 /apps/prometheus-webhook-dingtalk

配置service文件

[Unit]
Description=Prometheus dingtalk
After=network.target

[Service]
Restart=on-failure
ExecStart=/apps/prometheus-webhook-dingtalk/prometheus-webhook-dingtalk --web.listen-address="0.0.0.0:8060" --ding.profile="alertname=https://oapi.dingtalk.com/robot/send?access_token=1179c64f197a5da70d4b393111dd43298e58f8112e22f3e00d6632591337c43a"

[Install]
WantedBy=multi-user.target

启动dingtalk

# systemctl enable --now prometheus_dingtalk.service
# ss -ntlup |grep 8060
tcp     LISTEN   0        4096                   *:8060                 *:*      users:(("prometheus-webh",pid=25625,fd=3))

3.2.4 配置altermanager

配置alertmanager.yml追加以下配置

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 2m
  ## 这一行将接受人变更成钉钉
  receiver: 'dingtalk'
receivers:
- name: 'dingtalk'
  webhook_configs:
  - url: 'http://192.168.31.201:8060/dingtalk/alertname/send'
    send_resolved: true

重启alertmanager后,就能在钉钉上收到消息了

grafana 钉钉报警显示折叠 alertmanager钉钉报警_kubernetes_11

3.2.5 钉钉转发流程

  1. prometheus通过rule匹配到告警(prometheus.yml/rule_files:中指定的rule.yaml)
  2. 触发告警后,将告警转发给alertmanager(prometheus.yml/alerting:/alertmanagers:定义altermanager地址)
  3. alertmanager将收到的告警通过alertmanager.yml/route:/receiver: 'dingtalk’将告警转发给钉钉,并通过receivers:/webhook_configs:的配置来指定dingtalk webhook
  4. webhook通过启动文件配置中的–ding.profile最终将消息转发到钉钉

grafana 钉钉报警显示折叠 alertmanager钉钉报警_云原生_12

3.3 企业微信告警

3.3.1 创建机器人

grafana 钉钉报警显示折叠 alertmanager钉钉报警_kubernetes_13

发送测试数据,确保客户端能收到

grafana 钉钉报警显示折叠 alertmanager钉钉报警_微信_14

3.3.2 拿到3个ID

企业ID

grafana 钉钉报警显示折叠 alertmanager钉钉报警_云原生_15

AgentId

grafana 钉钉报警显示折叠 alertmanager钉钉报警_微信_16

Secret:

grafana 钉钉报警显示折叠 alertmanager钉钉报警_grafana 钉钉报警显示折叠_17


grafana 钉钉报警显示折叠 alertmanager钉钉报警_微信_18

3.3.3 altermanager配置

  1. 通过修改receiver将altermanager的告警转发给企业微信
  2. 企业微信中分别设置企业id corp_id,AgentId agent_id,Secret api_secret
  3. to_party来限制消息发给哪些人
route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 2m
  repeat_interval: 10m
  receiver: 'wechat'
- name: 'wechat'
  wechat_configs:
  - corp_id: 'ww11cfebc2eb8be3e2'
    to_party: '2'		# 发给哪个部门,可以在部门边上的...查看部门id
    agent_id: '1000003'
    api_secret: '9j4EAng3zKXabEMVkRQemGtBZQVA2728jHATBHgXD3w'
    send_resolved: true

3.3.4 将服务器IP添加到微信白名单

登录企业微信后台,在应用管理中打开机器人,在下面找到企业可惜IP,添加公网IP.
公网IP可以使用下面命令查询.

# curl ifconfig.me

grafana 钉钉报警显示折叠 alertmanager钉钉报警_grafana 钉钉报警显示折叠_19

3.3.4 重启alertmanager

# systemctl restart alertmanager

此时在客户端已经收到了告警信息

grafana 钉钉报警显示折叠 alertmanager钉钉报警_微信_20