一、前言
业务项目引用kube-ptometheusrelease-0.10,用于监控Kubernetes集群状态,但是该监控项目默认预设了告警规则,但并不能实现告警通知,需要我们结合自己企业内部实际情况来做告警通知机制,本篇带大家了解一下企微告警通知
二、实施流程
说到企微告警,在过去几年间,常见的方式就是登录企微后台到应用管理,单独新建应用作为监控消息群组,然后通过调用其secret id来实现告警通知.....这种方式不仅繁琐鸡肋,另外最大的痛点就是告警群组内无法正常沟通,只能接收告警,也不灵活。如果想实现分组,难道我们在创建一个应用,在作为一个告警群接收? 这不是我们想要的...虽然能接收告警,但我们造成很大的困扰。如今,我们采用了webhook的的方式来实现。本次介绍两种调用企微webhook的方案。
2.1、获取企微Webhook Key
单独创建企微群组,并创建企微webhook机器人,获取webhook Key
2.2、编写Deployment资源
创建Deployment资源,其中arsgs指定企微webhook 的key,主要是用于在Pod运行的过程中传递命令行参数。
#vim webcaht-test.yaml
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
run: prometheus-webhook-qywx
name: prometheus-webhook-qywx
namespace: monitoring
spec:
selector:
matchLabels:
run: prometheus-webhook-qywx
template:
metadata:
labels:
run: prometheus-webhook-qywx
spec:
containers:
- args:
- --adapter=/app/prometheusalert/wx.js=/adapter/wx=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=[指定自己企微机器的key值]
image: registry.cn-beijing.aliyuncs.com/devops-prw/webhook-wechat:latest
name: prometheus-webhook-dingtalk
ports:
- containerPort: 80
protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
labels:
run: prometheus-webhook-qywx
name: prometheus-webhook-qywx
namespace: monitoring
spec:
ports:
- port: 8060
protocol: TCP
targetPort: 80
selector:
run: prometheus-webhook-qywx
type: ClusterIP
#kubecl apply -f webcaht-test.yaml
2.3、配置告警通知
此时我们拿到对应的Service 之后,写到prometheus-secret作为一个传参url调用
#vim alertmanager-secret.yaml
apiVersion: v1
kind: Secret
metadata:
labels:
alertmanager: main
app.kubernetes.io/component: alert-router
app.kubernetes.io/name: alertmanager
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 0.23.0
name: alertmanager-main
namespace: monitoring
stringData:
wechat.tmpl: |-
{{ define "wechat.to.message" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{- range $index, $alert := .Alerts -}}
========= **监控告警** =========
**告警集群:** kube-k8s
**告警类型:** {{ $alert.Labels.alertname }}
**告警级别:** {{ $alert.Labels.severity }}
**告警状态:** {{ .Status }}
**故障主机:** {{ $alert.Labels.instance }} {{ $alert.Labels.device }}
**告警主题:** {{ .Annotations.summary }}
**告警详情:** {{ $alert.Annotations.message }}{{ $alert.Annotations.description}}
**主机标签:** {{ range .Labels.SortedPairs }}
[{{ .Name }}: {{ .Value | markdown | html }} ]
{{- end }}
**故障时间:** {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
========= = **end** = =========
{{- end }}
{{- end }}
{{- if gt (len .Alerts.Resolved) 0 -}}
{{- range $index, $alert := .Alerts -}}
========= **故障恢复** =========
**告警集群:** kube-k8s
**告警主题:** {{ $alert.Annotations.summary }}
**告警主机:** {{ .Labels.instance }}
**告警类型:** {{ .Labels.alertname }}
**告警级别:** {{ $alert.Labels.severity }}
**告警状态:** {{ .Status }}
**告警详情:** {{ $alert.Annotations.message }}{{ $alert.Annotations.description}}
**故障时间:** {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
**恢复时间:** {{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
========= = **end** = =========
{{- end }}
{{- end }}
{{- end }}
alertmanager.yaml: |-
global:
resolve_timeout: 5m
receivers:
- name: 'webhook'
webhook_configs:
- url: 'http://prometheus-webhook-qywx.monitoring.svc.cluster.local:8060/adapter/wx'
send_resolved: true
message: '{{ template "wechat.default.message" . }}'
templates:
- '/etc/alertmanager/config/wechat.tmpl'
route:
group_by:
- job
- namespaces
- alertname
group_interval: 3m
group_wait: 30s
receiver: webhook
repeat_interval: 5m #一条告警通知发送成功之后,再次发送告警需要等待的时间,这里设置5分钟,也就是说中间需要等待五分钟才能告警,如43分发送的第一次告警,那么49分我们才会收到第二次相同的告警
routes:
- match:
severity: warning
receiver: webhook
type: Opaque
虽然是收到告警,但告警内容都是默认的,不管怎么引用监控模版都无济于事,也没有报错,折腾了很久都没有引用上告警模版。对这方面感兴趣的大佬能帮解决这个问题,可以在下方留言
2.4、基于Promehtues-Flask程序实现告警通知
Alertmanager告警信息发送到企微群组,首先会先发送到promehtues-flask程序,然后再由promehtues-flask对信息格式解析后在发送到企微群组,至于如何实现,会调用前面我们的webhook key实现。
官网Github:https://github.com/hsggj002/prometheus-flask
修改app/Alert.Py
#vim app/Alert.py
# https://github.com/hsggj002/prometheus-flask/issues/3
#问题汇总
#1、Github源码告警中多机房,所以代码中有,我们的场景是k8s集群以及监控集群外部的资源,因此不需要,这里去掉了;
#2、原始Github中的文件内容,首先修改了region参数的告警信息;但我这里想增加pod_name并且恢复告警的内容中没有定义恢复时间
#3、有些Pod触发了,发送告警发现Pod名称为None状态,具体请看下文图,秉持着尽善尽美的态度,这做了修改
# 为了解决这两个问题,我将Alert.py改了一下,可以直接使用;
# -*- coding: UTF-8 -*-
from doctest import debug_script
from pydoc import describe
from flask import jsonify
import requests
import json
import datetime
import sys
def parse_time(*args):
times = []
for dates in args:
eta_temp = dates
if len(eta_temp.split('.')) >= 2:
if 'Z' in eta_temp.split('.')[1]:
s_eta = eta_temp.split('.')[0] + '.' + eta_temp.split('.')[1][-5:]
fd = datetime.datetime.strptime(s_eta, "%Y-%m-%dT%H:%M:%S.%fZ")
else:
eta_str = eta_temp.split('.')[1] = 'Z'
fd = datetime.datetime.strptime(eta_temp.split('.')[0] + eta_str, "%Y-%m-%dT%H:%M:%SZ")
else:
fd = datetime.datetime.strptime(eta_temp, "%Y-%m-%dT%H:%M:%SZ")
eta = (fd + datetime.timedelta(hours=8)).strftime("%Y-%m-%d %H:%M:%S.%f")
times.append(eta)
return times
def alert(status,alertnames,levels,times,pod_name,ins,instance,description):
params = json.dumps({
"msgtype": "markdown",
"markdown":
{
"content": "## <font color=\"red\">告警通知: {0}</font>\n**告警名称:** <font color=\"warning\">{1}</font>\n**告警级别:** {2}\n**告警时间:** {3}\n**Pod名称**: {4}\n{5}: {6}\n**告警详情:** <font color=\"comment\">{7}</font>".format(status,alertnames,levels,times[0],pod_name,ins,instance,description)
}
})
return params
def recive(status,alertnames,levels,times,pod_name,ins,instance,description):
params = json.dumps({
"msgtype": "markdown",
"markdown":
{
"content": "## <font color=\"info\">恢复通知: {0}</font>\n**告警名称:** <font color=\"warning\">{1}</font>\n**告警级别:** {2}\n**告警时间:** {3}\n**恢复时间:** {4}\n**Pod名称:** {5}\n{6}: {7}\n**告警详情:** <font color=\"comment\">{8}</font>".format(status,alertnames,levels,times[0],times[1],pod_name,ins,instance,description)
}
})
return params
def webhook_url(params,url_key):
headers = {"Content-type": "application/json"}
"""
*****重要*****
"""
url = "{}".format(url_key)
r = requests.post(url,params,headers)
def send_alert(json_re,url_key):
print(json_re)
for i in json_re['alerts']:
if i['status'] == 'firing':
if "instance" in i['labels'] and "pod" in i['labels']:
if "description" in i['annotations']:
webhook_url(alert(i['status'],i['labels']['alertname'],i['labels']['severity'],parse_time(i['startsAt']),i['labels']['pod'],'故障实例',i['labels']['instance'],i['annotations']['description']),url_key)
elif "message" in i['annotations']:
webhook_url(alert(i['status'],i['labels']['alertname'],i['labels']['severity'],parse_time(i['startsAt']),i['labels']['pod'],'故障实例',i['labels']['instance'],i['annotations']['message']),url_key)
else:
webhook_url(alert(i['status'],i['labels']['alertname'],i['labels']['severity'],parse_time(i['startsAt']),i['labels']['pod'],'故障实例',i['labels']['instance'],'Service is wrong'),url_key)
elif "namespace" in i['labels']:
webhook_url(alert(i['status'],i['labels']['alertname'],i['labels']['severity'],parse_time(i['startsAt']),i['labels'].get('pod', 'None'),'名称空间',i['labels']['namespace'],i['annotations']['description']),url_key)
elif "Watchdog" in i['labels']['alertname']:
webhook_url(alert(i['status'],i['labels']['alertname'],'0','0','0','故障实例','自测','0'),url_key)
elif i['status'] == 'resolved': #如果字典 i 中的 status 键的值等于 'resolved',则执行以下代码块
if "instance" in i['labels'] and "pod" in i['labels']: #如果字典 i 的 labels 中同时包含 'instance' 和 'pod' 两个键,则执行以下逻辑
if "description" in i['annotations']: #如果 i['annotations'] 中包含 'description' 键,调用 webhook_url 函数并传递 recive 函数的结果作为参数
webhook_url(recive(i['status'],i['labels']['alertname'],i['labels']['severity'],parse_time(i['startsAt'],i['endsAt']),i['labels']['pod'],'故障实例',i['labels']['instance'],i['annotations']['description']),url_key)
elif "message" in i['annotations']: #如果 i['annotations'] 中包含 'message' 键,同样调用 webhook_url 函数并传递类似的参数,不同之处在于使用 i['annotations']['message'] 替代 i['annotations']['description']
webhook_url(recive(i['status'],i['labels']['alertname'],i['labels']['severity'],parse_time(i['startsAt'],i['endsAt']),i['labels']['pod'],'故障实例',i['labels']['instance'],i['annotations']['message']),url_key)
else: #如果上述两个判断条件都不满足,则调用webhook_url函数传递默认消息
webhook_url(recive(i['status'],i['labels']['alertname'],i['labels']['severity'],parse_time(i['startsAt'],i['endsAt']),i['labels']['pod'],'故障实例',i['labels']['instance'],'Service is wrong'),url_key)
elif "namespace" in i['labels']: #调用webhoo_url函数传递'recive'函数的结果作为参数,其中i['labels'].get('pod', 'None') 获取'Pod'键值,如果没有,则使用'None'作为默认值
webhook_url(recive(i['status'],i['labels']['alertname'],i['labels']['severity'],parse_time(i['startsAt'],i['endsAt']),i['labels'].get('pod', 'None'),'名称空间',i['labels']['namespace'],i['annotations']['description']),url_key)
elif "Watchdog" in i['labels']['alertname']:
webhook_url(alert(i['status'],i['labels']['alertname'],'0','0','0','故障实例','自测','0'),url_key)
构建编译
python:3.10.2 基础镜像源自从Docker Hub封禁了,导致国内无法下载使用,这里大家需要注意下
# 修改Dockerfile
#vim prometheus-flask/Dockerfile
FROM python:3.10.2
COPY ./app /app
COPY ./requirements.txt /app/requirements.txt
WORKDIR /app
# 指定了一下国内镜像,由于requirement.txt中有要安装的包,下载又很慢,所以选择国内镜像;
RUN pip install -r /app/requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
ENTRYPOINT ["python", "/app/main.py"]
# 修改requirements.txt文件
#vim prometheus-flask/requirements.txt
flask_json == 0.3.4
flask == 2.0.1
requests == 2.19.1
gevent == 21.12.0
Werkzeug == 2.0.1
#docker build -t registry.cn-beijing.aliyuncs.com/devops-op/prometheus-flask:v3
编写Deployment 资源
#vim pyhook.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: alertinfo
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: alertinfo
release: stable
template:
metadata:
labels:
app: alertinfo
release: stable
spec:
containers:
- name: alertinfo-flask
image: registry.cn-beijing.aliyuncs.com/devops-op/prometheus-flask:v3
imagePullPolicy: Always
ports:
- name: http
containerPort: 80
args:
- "-k https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=企微机器人webhook Key值"
- "-p 80"
---
apiVersion: v1
kind: Service
metadata:
name: alertinfo-svc
namespace: monitoring
spec:
selector:
app: alertinfo
release: stable
type: ClusterIP
ports:
- name: http
targetPort: 80
port: 80
三、验证
验证告警通知机器是否正常。
#创建一个无法正常运行的Deployment资源
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "1"
creationTimestamp: "2021-01-11T02:54:57Z"
generation: 1
labels:
app: nginx
name: nginx
namespace: default
spec:
progressDeadlineSeconds: 600
replicas: 1 #pod副本最大期望副本数
revisionHistoryLimit: 10 #deployment历史版本保留最大值,一般用来保留旧的Replicaset数量,方便后续回滚更新,为了保存版本升级历史,需要在创建deployment对象时,需要指定“--record”选项
selector: #定义deployment标签,用于该控制器如何查找要管理的pods
matchLabels:
app: nginx
strategy: #更新策略设置,指定新Pod替换旧Pod,
rollingUpdate: #滚动更新策略
maxSurge: 25% #在滚动升级时,所更新的pod数量不能超过期望pod数量的25%
maxUnavailable: 25% #最大不可用值(百分比),滚动更新时立即将旧的Rs缩容到75%;确保在���新期间可用的pod数为75%;
type: RollingUpdate #deployment支持的更新策略,默认rollingUpdate,另一种是Recreate(一次性终止所有旧版本后并一次性发布);
template:
metadata:
creationTimestamp: null
labels:
app: nginx
spec:
containers:
- image: registry.cn-beijing.aliyuncs.com/devops-op/nginx:latest222
imagePullPolicy: IfNotPresent #镜像拉取策略,IfNotPresent表示如果本地存在所需要的镜像,则不用重新拉取
name: nginx
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst #dns策略,这里集群优先级解析
restartPolicy: Always #重启策略
schedulerName: default-scheduler #调度器策略,这里选择的默认调度策略
securityContext: {}
terminationGracePeriodSeconds: 30
此时就会触发告警 Pod未就绪的告警通知,之后在手动恢复即可
模拟TargetDown告警及恢复