一、前言

业务项目引用kube-ptometheusrelease-0.10,用于监控Kubernetes集群状态,但是该监控项目默认预设了告警规则,但并不能实现告警通知,需要我们结合自己企业内部实际情况来做告警通知机制,本篇带大家了解一下企微告警通知

二、实施流程

说到企微告警,在过去几年间,常见的方式就是登录企微后台到应用管理,单独新建应用作为监控消息群组,然后通过调用其secret id来实现告警通知.....这种方式不仅繁琐鸡肋,另外最大的痛点就是告警群组内无法正常沟通,只能接收告警,也不灵活。如果想实现分组,难道我们在创建一个应用,在作为一个告警群接收? 这不是我们想要的...虽然能接收告警,但我们造成很大的困扰。如今,我们采用了webhook的的方式来实现。本次介绍两种调用企微webhook的方案。

2.1、获取企微Webhook Key

单独创建企微群组,并创建企微webhook机器人,获取webhook Key

Kube-proemhtues 企微机器人Webhook告警通知_promehtues

Kube-proemhtues 企微机器人Webhook告警通知_promehtues_02

2.2、编写Deployment资源

创建Deployment资源,其中arsgs指定企微webhook 的key,主要是用于在Pod运行的过程中传递命令行参数。

#vim webcaht-test.yaml
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    run: prometheus-webhook-qywx
  name: prometheus-webhook-qywx
  namespace: monitoring
spec:
  selector:
    matchLabels:
      run: prometheus-webhook-qywx
  template:
    metadata:
      labels:
        run: prometheus-webhook-qywx
    spec:
      containers:
      - args:
        - --adapter=/app/prometheusalert/wx.js=/adapter/wx=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=[指定自己企微机器的key值]
        image: registry.cn-beijing.aliyuncs.com/devops-prw/webhook-wechat:latest  
        name: prometheus-webhook-dingtalk
        ports:
        - containerPort: 80
          protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
  labels:
    run: prometheus-webhook-qywx
  name: prometheus-webhook-qywx
  namespace: monitoring
spec:
  ports:
  - port: 8060
    protocol: TCP
    targetPort: 80
  selector:
    run: prometheus-webhook-qywx
  type: ClusterIP
#kubecl apply -f webcaht-test.yaml

Kube-proemhtues 企微机器人Webhook告警通知_promehtues_03

2.3、配置告警通知

此时我们拿到对应的Service 之后,写到prometheus-secret作为一个传参url调用

#vim alertmanager-secret.yaml
apiVersion: v1
kind: Secret
metadata:
  labels:
    alertmanager: main
    app.kubernetes.io/component: alert-router
    app.kubernetes.io/name: alertmanager
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/version: 0.23.0
  name: alertmanager-main
  namespace: monitoring
stringData:
  wechat.tmpl: |-
    {{ define "wechat.to.message" }}
    {{- if gt (len .Alerts.Firing) 0 -}}
    {{- range $index, $alert := .Alerts -}}

    =========  **监控告警** =========

    **告警集群:**     kube-k8s
    **告警类型:**    {{ $alert.Labels.alertname }}
    **告警级别:**    {{ $alert.Labels.severity }}
    **告警状态:**    {{ .Status }}
    **故障主机:**    {{ $alert.Labels.instance }} {{ $alert.Labels.device }}
    **告警主题:**    {{ .Annotations.summary }}
    **告警详情:**    {{ $alert.Annotations.message }}{{ $alert.Annotations.description}}
    **主机标签:**    {{ range .Labels.SortedPairs  }}
    [{{ .Name }}: {{ .Value | markdown | html }} ]
    {{- end }}


    **故障时间:**    {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
    ========= = **end** =  =========
    {{- end }}
    {{- end }}

    {{- if gt (len .Alerts.Resolved) 0 -}}
    {{- range $index, $alert := .Alerts -}}

    ========= **故障恢复** =========
    **告警集群:**     kube-k8s
    **告警主题:**    {{ $alert.Annotations.summary }}
    **告警主机:**    {{ .Labels.instance }}
    **告警类型:**    {{ .Labels.alertname }}
    **告警级别:**    {{ $alert.Labels.severity }}
    **告警状态:**    {{ .Status }}
    **告警详情:**    {{ $alert.Annotations.message }}{{ $alert.Annotations.description}}
    **故障时间:**    {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
    **恢复时间:**    {{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}

    ========= = **end** =  =========
    {{- end }}
    {{- end }}
    {{- end }}
  alertmanager.yaml: |-
    global:
      resolve_timeout: 5m
    receivers:
      - name: 'webhook'
        webhook_configs:
          - url: 'http://prometheus-webhook-qywx.monitoring.svc.cluster.local:8060/adapter/wx'
            send_resolved: true
            message: '{{ template "wechat.default.message" . }}'
    templates:
    - '/etc/alertmanager/config/wechat.tmpl'
    route:
      group_by:
        - job
        - namespaces
        - alertname
      group_interval: 3m
      group_wait: 30s
      receiver: webhook
      repeat_interval: 5m #一条告警通知发送成功之后,再次发送告警需要等待的时间,这里设置5分钟,也就是说中间需要等待五分钟才能告警,如43分发送的第一次告警,那么49分我们才会收到第二次相同的告警
      routes:
        - match:
            severity: warning
          receiver: webhook
type: Opaque

虽然是收到告警,但告警内容都是默认的,不管怎么引用监控模版都无济于事,也没有报错,折腾了很久都没有引用上告警模版。对这方面感兴趣的大佬能帮解决这个问题,可以在下方留言

Kube-proemhtues 企微机器人Webhook告警通知_promehtues_04

2.4、基于Promehtues-Flask程序实现告警通知

Alertmanager告警信息发送到企微群组,首先会先发送到promehtues-flask程序,然后再由promehtues-flask对信息格式解析后在发送到企微群组,至于如何实现,会调用前面我们的webhook key实现。

官网Github:https://github.com/hsggj002/prometheus-flask

修改app/Alert.Py

#vim app/Alert.py
# https://github.com/hsggj002/prometheus-flask/issues/3
#问题汇总
#1、Github源码告警中多机房,所以代码中有,我们的场景是k8s集群以及监控集群外部的资源,因此不需要,这里去掉了;
#2、原始Github中的文件内容,首先修改了region参数的告警信息;但我这里想增加pod_name并且恢复告警的内容中没有定义恢复时间
#3、有些Pod触发了,发送告警发现Pod名称为None状态,具体请看下文图,秉持着尽善尽美的态度,这做了修改
# 为了解决这两个问题,我将Alert.py改了一下,可以直接使用;
# -*- coding: UTF-8 -*-
from doctest import debug_script
from pydoc import describe
from flask import jsonify
import requests
import json
import datetime
import sys

def parse_time(*args):
    times = []
    for dates in args:
        eta_temp = dates
        if len(eta_temp.split('.')) >= 2:
            if 'Z' in eta_temp.split('.')[1]:
                s_eta = eta_temp.split('.')[0] + '.' + eta_temp.split('.')[1][-5:]
                fd = datetime.datetime.strptime(s_eta, "%Y-%m-%dT%H:%M:%S.%fZ")
            else:
                eta_str = eta_temp.split('.')[1] = 'Z'
                fd = datetime.datetime.strptime(eta_temp.split('.')[0] + eta_str, "%Y-%m-%dT%H:%M:%SZ")
        else:
            fd = datetime.datetime.strptime(eta_temp, "%Y-%m-%dT%H:%M:%SZ")
        eta = (fd + datetime.timedelta(hours=8)).strftime("%Y-%m-%d %H:%M:%S.%f")
        times.append(eta)
    return times

def alert(status,alertnames,levels,times,pod_name,ins,instance,description):
    params = json.dumps({
        "msgtype": "markdown",
        "markdown":
            {
                "content": "## <font color=\"red\">告警通知: {0}</font>\n**告警名称:** <font color=\"warning\">{1}</font>\n**告警级别:** {2}\n**告警时间:** {3}\n**Pod名称**: {4}\n{5}: {6}\n**告警详情:** <font color=\"comment\">{7}</font>".format(status,alertnames,levels,times[0],pod_name,ins,instance,description)
            }
        })

    return params

def recive(status,alertnames,levels,times,pod_name,ins,instance,description):
    params = json.dumps({
        "msgtype": "markdown",
        "markdown":
            {
                "content": "## <font color=\"info\">恢复通知: {0}</font>\n**告警名称:** <font color=\"warning\">{1}</font>\n**告警级别:** {2}\n**告警时间:** {3}\n**恢复时间:** {4}\n**Pod名称:** {5}\n{6}: {7}\n**告警详情:** <font color=\"comment\">{8}</font>".format(status,alertnames,levels,times[0],times[1],pod_name,ins,instance,description)
            }
        })

    return params

def webhook_url(params,url_key):
    headers = {"Content-type": "application/json"}
    """
    *****重要*****
    """
    url = "{}".format(url_key)
    r = requests.post(url,params,headers)

def send_alert(json_re,url_key):
    print(json_re)
    for i in json_re['alerts']:
        if i['status'] == 'firing':
            if "instance" in i['labels'] and "pod" in i['labels']:
                if "description" in i['annotations']:
                    webhook_url(alert(i['status'],i['labels']['alertname'],i['labels']['severity'],parse_time(i['startsAt']),i['labels']['pod'],'故障实例',i['labels']['instance'],i['annotations']['description']),url_key)

                elif "message" in i['annotations']:
                    webhook_url(alert(i['status'],i['labels']['alertname'],i['labels']['severity'],parse_time(i['startsAt']),i['labels']['pod'],'故障实例',i['labels']['instance'],i['annotations']['message']),url_key)
                else:
                    webhook_url(alert(i['status'],i['labels']['alertname'],i['labels']['severity'],parse_time(i['startsAt']),i['labels']['pod'],'故障实例',i['labels']['instance'],'Service is wrong'),url_key)

            elif "namespace" in i['labels']:
                webhook_url(alert(i['status'],i['labels']['alertname'],i['labels']['severity'],parse_time(i['startsAt']),i['labels'].get('pod', 'None'),'名称空间',i['labels']['namespace'],i['annotations']['description']),url_key)

            elif "Watchdog" in i['labels']['alertname']:
                webhook_url(alert(i['status'],i['labels']['alertname'],'0','0','0','故障实例','自测','0'),url_key)
        elif i['status'] == 'resolved':  #如果字典 i 中的 status 键的值等于 'resolved',则执行以下代码块
            if "instance" in i['labels'] and "pod" in i['labels']:  #如果字典 i 的 labels 中同时包含 'instance' 和 'pod' 两个键,则执行以下逻辑
            
            if "description" in i['annotations']: #如果 i['annotations'] 中包含 'description' 键,调用 webhook_url 函数并传递 recive 函数的结果作为参数
                    webhook_url(recive(i['status'],i['labels']['alertname'],i['labels']['severity'],parse_time(i['startsAt'],i['endsAt']),i['labels']['pod'],'故障实例',i['labels']['instance'],i['annotations']['description']),url_key)
            
            elif "message" in i['annotations']:  #如果 i['annotations'] 中包含 'message' 键,同样调用 webhook_url 函数并传递类似的参数,不同之处在于使用 i['annotations']['message'] 替代 i['annotations']['description']
                    webhook_url(recive(i['status'],i['labels']['alertname'],i['labels']['severity'],parse_time(i['startsAt'],i['endsAt']),i['labels']['pod'],'故障实例',i['labels']['instance'],i['annotations']['message']),url_key)
            
            else:  #如果上述两个判断条件都不满足,则调用webhook_url函数传递默认消息
                    webhook_url(recive(i['status'],i['labels']['alertname'],i['labels']['severity'],parse_time(i['startsAt'],i['endsAt']),i['labels']['pod'],'故障实例',i['labels']['instance'],'Service is wrong'),url_key)

            
            elif "namespace" in i['labels']: #调用webhoo_url函数传递'recive'函数的结果作为参数,其中i['labels'].get('pod', 'None') 获取'Pod'键值,如果没有,则使用'None'作为默认值
                webhook_url(recive(i['status'],i['labels']['alertname'],i['labels']['severity'],parse_time(i['startsAt'],i['endsAt']),i['labels'].get('pod', 'None'),'名称空间',i['labels']['namespace'],i['annotations']['description']),url_key)

            elif "Watchdog" in i['labels']['alertname']:
                webhook_url(alert(i['status'],i['labels']['alertname'],'0','0','0','故障实例','自测','0'),url_key)

构建编译

python:3.10.2 基础镜像源自从Docker Hub封禁了,导致国内无法下载使用,这里大家需要注意下

# 修改Dockerfile
#vim prometheus-flask/Dockerfile
FROM python:3.10.2
COPY ./app /app
COPY ./requirements.txt /app/requirements.txt
WORKDIR /app
# 指定了一下国内镜像,由于requirement.txt中有要安装的包,下载又很慢,所以选择国内镜像;
RUN pip install -r /app/requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
ENTRYPOINT ["python", "/app/main.py"]
 
# 修改requirements.txt文件
#vim prometheus-flask/requirements.txt
flask_json == 0.3.4
flask == 2.0.1
requests == 2.19.1
gevent == 21.12.0    
Werkzeug == 2.0.1
#docker build -t registry.cn-beijing.aliyuncs.com/devops-op/prometheus-flask:v3

编写Deployment 资源

#vim pyhook.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: alertinfo
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: alertinfo
      release: stable
  template:
    metadata:
      labels:
        app: alertinfo 
        release: stable
    spec:
      containers:
      - name: alertinfo-flask
        image: registry.cn-beijing.aliyuncs.com/devops-op/prometheus-flask:v3
        imagePullPolicy: Always
        ports:
        - name: http
          containerPort: 80
        args:
        - "-k https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=企微机器人webhook Key值"
        - "-p 80"
---
apiVersion: v1
kind: Service
metadata:
  name: alertinfo-svc
  namespace: monitoring
spec:
  selector:
    app: alertinfo
    release: stable
  type: ClusterIP
  ports:
  - name: http 
    targetPort: 80
    port: 80

Kube-proemhtues 企微机器人Webhook告警通知_promehtues_05

三、验证

验证告警通知机器是否正常。

#创建一个无法正常运行的Deployment资源
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
  creationTimestamp: "2021-01-11T02:54:57Z"
  generation: 1
  labels:
    app: nginx
  name: nginx
  namespace: default
spec:
  progressDeadlineSeconds: 600
  replicas: 1  #pod副本最大期望副本数
  revisionHistoryLimit: 10     #deployment历史版本保留最大值,一般用来保留旧的Replicaset数量,方便后续回滚更新,为了保存版本升级历史,需要在创建deployment对象时,需要指定“--record”选项
  selector:   #定义deployment标签,用于该控制器如何查找要管理的pods
    matchLabels:
      app: nginx
  strategy:   #更新策略设置,指定新Pod替换旧Pod,
    rollingUpdate:      #滚动更新策略
      maxSurge: 25%  #在滚动升级时,所更新的pod数量不能超过期望pod数量的25%
      maxUnavailable: 25%   #最大不可用值(百分比),滚动更新时立即将旧的Rs缩容到75%;确保在���新期间可用的pod数为75%;
    type: RollingUpdate  #deployment支持的更新策略,默认rollingUpdate,另一种是Recreate(一次性终止所有旧版本后并一次性发布);
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: nginx
    spec:
      containers:
      - image: registry.cn-beijing.aliyuncs.com/devops-op/nginx:latest222 
        imagePullPolicy: IfNotPresent   #镜像拉取策略,IfNotPresent表示如果本地存在所需要的镜像,则不用重新拉取
        name: nginx
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst  #dns策略,这里集群优先级解析
      restartPolicy: Always  #重启策略
      schedulerName: default-scheduler   #调度器策略,这里选择的默认调度策略
      securityContext: {}
      terminationGracePeriodSeconds: 30

此时就会触发告警 Pod未就绪的告警通知,之后在手动恢复即可

Kube-proemhtues 企微机器人Webhook告警通知_promehtues_06

模拟TargetDown告警及恢复

Kube-proemhtues 企微机器人Webhook告警通知_promehtues_07