AlertManager安装及告警发送
Prometheus触发一条告警的过程
prometheus—> 触发阈值—>超出持续时间—>AlertManager—>分组|抑制|静默—>媒体类型—>邮件|钉钉|微信
分组(group): 将性质类似的报警合并为单个通知,比如网络通知,主机通知,服务通知
静默(silences): 是一种简单的特定时间静音机制.例如服务器维护前设置静默,晚上定时更新时设置静默
抑制(inhibition):当报警发出后,停止重复发送由此告警引发的其他告警,即合并一个故障引起的多个报警时间,可以消除冗余告警
1. 安装AlertManager
可以和prometheus在同一个服务器或不同服务器,只要网络能通就可以.
# cd /apps
# wget https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gz
# tar xf alertmanager-0.24.0.linux-amd64.tar.gz
# ln -sf /apps/alertmanager-0.24.0.linux-amd64 /apps/alertmanager
# ln -sf /apps/alertmanager-0.24.0.linux-amd64/alertmanager /usr/bin/
2. 配置service文件
/etc/systemd/system/alertmanager.service
[Unit]
Description=Prometheus AlertManager
After=network.target
[Service]
Restart=on-failure
ExecStart=/usr/bin/alertmanager --config.file=/apps/alertmanager/alertmanager.yml
[Install]
WantedBy=multi-user.target
启动服务
# systemctl enable --now alertmanager.service
3. 告警通知
3.1 邮件告警
参数 | 含义 |
smtp_from | 发件人邮箱 |
smtp_smarthost | 邮件服务器smtp地址 |
smtp_auth_username | 发件人账号 |
smtp_auth_password | 发件人密码 |
smtp_require_tls | 是否需要tls协议 |
wechart_api_url | 企业微信API地址 |
wechart_api_secret | 企业微信API SECRET |
wechat_api_corp_id | 企业微信CORP ID信息 |
resolve_timeout: 60s | 当一个告警在AlertManager持续多久未接受到新告警后标记告警状态为resolved |
group_by: [‘alertname’] | 根据标签进行分组 |
group_wait: 10s | 组告警等待时间,告警产生后等待10秒,10秒内有相同告警则合并发送 |
group_interval: 1m | 第二次产生告警,间隔1m如果没有恢复,进入repeat_interval |
repeat_interval: 2m | 当第一条告警成功发送后,等待2分钟,如果没有恢复,再发送 |
severity: ‘critical’ | 告警级别 |
alertmanager.yml
global:
# 在没有报警的情况下声明为已解决的时间
resolve_timeout: 1m
# 配置邮件发送信息
smtp_smarthost: 'smtp.qq.com:465'
smtp_from: '2841809255@qq.com'
smtp_auth_username: '2841809255@qq.com'
smtp_auth_password: 'yujbvgdfhwd0bdij'
smtp_hello: 'qq.com'
smtp_require_tls: false
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 2m
receiver: 'web.hook'
receivers:
- name: 'web.hook'
email_configs:
- to: '13917099322@139.com'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
3.1.1 配置告警规则
# mkdir /apps/prometheus/rules
# vi /apps/prometheus/rules/pods_rule.yaml
rule配置
# 监控项说明
- alert: node内存可用大小
# prosql 表达式
expr: node_memory_MemFree_bytes < 2*1024*1024*1024
# 采样间隔
for: 15s
# 标签,告警级别
labels:
severity: critical
# 告警内容
annotations:
description: node可用内存小于2G
3.1.2 配置prometheus
修改prometheus.yaml
- 设置alertmanager
- 设置rule路径
3.1.2.1 设置alertmanager
altermanager之前是装在prometheus同一个服务器上,也可以另外装在其他服务器上.只需修改这里的IP地址即可.
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- 192.168.31.201:9093
3.1.2.2 指定rule位置
这里rule可以写相对路径,也可以写绝对路径
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "/apps/prometheus/rules/pods_rule.yaml"
3.1.2.3 重启prometheus
# systemctl restart prometheus.service
此时访问altermanager可以看到告警已经产生
登录邮箱,发现已经收到了2条来自于prometheus的告警邮件
在prometheus的alters中也可以看到相关告警.
3.2 钉钉告警
3.2.1 添加机器人
3.2.2 钉钉脚本
#!/bin/bash
source /etc/profile
MESSAGE=$1
/usr/bin/curl -X "POST" 'https://oapi.dingtalk.com/robot/send?access_token=1179c64f197a5da70d4b393111dd43298e58f8112e22f3e00d6632591337c43a'\
-H 'Content-Type: application/json' \
-d '{"msgtype": "text",
"text": {
"content": "'${MESSAGE}'"
}
}'
测试发送告警
root@prometheus-2:/apps/alertmanager# ./dingding.sh "alertname,测试测试"
{"errcode":0,"errmsg":"ok"}root@prometheus-2:/apps/alertmanager#
root@prometheus-2:/apps/alertmanager#
3.2.3 部署webhook-dingtalk
prometheus-webhook-dingtalk
root@prometheus-2:/apps# wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v1.4.0/prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz
root@prometheus-2:/apps# tar xf prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz
root@prometheus-2:/apps# ln -sf /apps/prometheus-webhook-dingtalk-1.4.0.linux-amd64 /apps/prometheus-webhook-dingtalk
配置service文件
[Unit]
Description=Prometheus dingtalk
After=network.target
[Service]
Restart=on-failure
ExecStart=/apps/prometheus-webhook-dingtalk/prometheus-webhook-dingtalk --web.listen-address="0.0.0.0:8060" --ding.profile="alertname=https://oapi.dingtalk.com/robot/send?access_token=1179c64f197a5da70d4b393111dd43298e58f8112e22f3e00d6632591337c43a"
[Install]
WantedBy=multi-user.target
启动dingtalk
# systemctl enable --now prometheus_dingtalk.service
# ss -ntlup |grep 8060
tcp LISTEN 0 4096 *:8060 *:* users:(("prometheus-webh",pid=25625,fd=3))
3.2.4 配置altermanager
配置alertmanager.yml追加以下配置
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 2m
## 这一行将接受人变更成钉钉
receiver: 'dingtalk'
receivers:
- name: 'dingtalk'
webhook_configs:
- url: 'http://192.168.31.201:8060/dingtalk/alertname/send'
send_resolved: true
重启alertmanager后,就能在钉钉上收到消息了
3.2.5 钉钉转发流程
- prometheus通过rule匹配到告警(prometheus.yml/rule_files:中指定的rule.yaml)
- 触发告警后,将告警转发给alertmanager(prometheus.yml/alerting:/alertmanagers:定义altermanager地址)
- alertmanager将收到的告警通过alertmanager.yml/route:/receiver: 'dingtalk’将告警转发给钉钉,并通过receivers:/webhook_configs:的配置来指定dingtalk webhook
- webhook通过启动文件配置中的–ding.profile最终将消息转发到钉钉
3.3 企业微信告警
3.3.1 创建机器人
发送测试数据,确保客户端能收到
3.3.2 拿到3个ID
企业ID
AgentId
Secret:
3.3.3 altermanager配置
- 通过修改receiver将altermanager的告警转发给企业微信
- 企业微信中分别设置企业id corp_id,AgentId agent_id,Secret api_secret
- to_party来限制消息发给哪些人
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 2m
repeat_interval: 10m
receiver: 'wechat'
- name: 'wechat'
wechat_configs:
- corp_id: 'ww11cfebc2eb8be3e2'
to_party: '2' # 发给哪个部门,可以在部门边上的...查看部门id
agent_id: '1000003'
api_secret: '9j4EAng3zKXabEMVkRQemGtBZQVA2728jHATBHgXD3w'
send_resolved: true
3.3.4 将服务器IP添加到微信白名单
登录企业微信后台,在应用管理中打开机器人,在下面找到企业可惜IP,添加公网IP.
公网IP可以使用下面命令查询.
# curl ifconfig.me
3.3.4 重启alertmanager
# systemctl restart alertmanager
此时在客户端已经收到了告警信息