组件说明

组件 说明 备注
Prometheus 很火的独立开源监控告警项目 https://prometheus.io/
AlertManager AlertManager管理报警信息 https://prometheus.io/docs/alerting/latest/alertmanager/
pushgateway 信息收集中转站进行数据上报采集 https://github.com/prometheus/pushgateway
prometheusAlert 开源运维告警中心消息转发系统 https://github.com/feiyu563/PrometheusAlert
飞书 办公协作产品 https://open.feishu.cn/document/ukTMukTMukTM/uUjNz4SN2MjL1YzM

组件流程图

Prometheus+Alertmanager+PrometheusAlert+飞书容器化部署实战

部署容器化测试环境

这里仅仅为了快速测试和验证功能,所以采用单容器 run 的,生产环境慎用!!

部署 Docker 服务

# 下载yum源
curl https://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo -o /etc/yum.repos.d/docker.repo
# 生成缓存
yum clean all
yum makecache
# 显示所有版本
yum list docker-ce --showduplicates | sort -r
# 指定 docker 版本    
yum install -y docker-ce-20.10.5
 # 启动 docker 服务    
systemctl start docker
# 查看 docker 服务状态
systemctl status docker
# 设置 docker 开机自启动    
systemctl enable docker

部署 Prometheus 服务

# 创建配置文件、报警规则和数据存放目录
mkdir -p /data/test/prometheus/{etc,data,rules}
# prometheus配置文件
# cat /data/test/prometheus/etc/prometheus.yml

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']
# 容器服务启动文件
# cat start_prometheus.sh
docker run -d --user root -p 9090:9090 --name prometheus \
    -v /data/test/prometheus/etc/prometheus.yml:/etc/prometheus/prometheus.yml \
    -v /data/test/prometheus/rules:/etc/prometheus/rules \
    -v /data/test/prometheus/data:/data/prometheus \
    prom/prometheus \
    --config.file="/etc/prometheus/prometheus.yml" \
    --storage.tsdb.path="/data/prometheus" \
    --web.listen-address="0.0.0.0:9090"
# 查看 prometheus 服务状态
docker ps -a |grep prometheus
# 查看 prometheus 服务日志
docker logs prometheus

部署 Node_Exporter 服务

方便我们后期做报警测试,这里也是使用容器部署。

# 创建存放目录
mkdir -p /data/test/node_exporter
# 容器服务启动文件
# cat start_node.sh
docker run -d --name node_exporter -p 9100:9100 \
    --restart=always \
    -v "/proc:/host/proc:ro" \
    -v "/sys:/host/sys:ro" \
    -v "/:/rootfs:ro" \
    prom/node-exporter \
    --path.procfs=/host/proc \
    --path.rootfs=/rootfs \
    --path.sysfs=/host/sys \
    --collector.filesystem.ignored-mount-points='^/(sys|proc|dev|host|etc)($$|/)'
# 查看 node_exporter 服务状态
docker ps -a |grep node_exporter
# 查看 node_exporter 服务日志
docker logs node_exporter

在Prometheus服务器端修改配置文件:

- job_name: 'node-service'
    static_configs:
      - targets: ['your-ip:9100'] # 填写实际IP地址

重启 Prometheus 容器:

docker restart prometheus

验证是否获取到数据,在浏览器输入:http://your-ip:9090/

Prometheus+Alertmanager+PrometheusAlert+飞书容器化部署实战

部署 Alertmanager 服务

# 创建配置文件和数据存放目录
mkdir -p /data/test/alertmanager/{etc,data}

创建配置文件:

# cat /data/test/alertmanager/etc/alertmanager.yml

global:
  resolve_timeout: 2m
  smtp_from: '319981932@qq.com'
  smtp_smarthost: 'smtp.qq.com:465'
  smtp_auth_username: '319981932@qq.com'
  # 注意这里需要配置QQ邮箱的授权码,不是登录密码,授权码在账户配置中查看
  smtp_auth_password: 'abcdefghijklmmop'
  smtp_require_tls: false

route:
  #group_by: ['alert_node']
  group_wait: 5s
  group_interval: 5s
  repeat_interval: 5m
  receiver: 'default'

receivers:
- name: 'default'
  wechat_configs:
  - corp_id: 'your-id'
    to_user: '1234567'
    agent_id: your-id # 注意是int
    api_secret: 'your-secret'
    send_resolved: true
# 容器服务启动文件
# cat start_alert.sh
docker run -d --user root -p 9093:9093 --name alertmanager \
    -v /data/test/alertmanager/etc/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
    -v /data/test/alertmanager/data:/alertmanager/data \
    prom/alertmanager:latest \
    --config.file="/etc/alertmanager/alertmanager.yml" \
    --web.listen-address="0.0.0.0:9093"
# 查看 alertmanager 服务状态
docker ps -a|grep alertmanager
# 查看 alertmanager 服务日志
docker logs alertmanager

查看alertmanager容器IP地址,用于配置prometheus对接接口:

# docker exec -it alertmanager /bin/sh -c "ip a"

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
10: eth0@if11: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue
    link/ether 02:42:ac:11:00:04 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.4/16 brd 172.17.255.255 scope global eth0
       valid_lft forever preferred_lft forever

修改prometheus配置文件对接alertmanager:

# cat /data/test/prometheus/etc/prometheus.yml

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - 172.17.0.4:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "/etc/prometheus/rules/*rules.yml"
  # - "second_rules.yml"

配置告警规则:

# cat /data/test/prometheus/rules/alert-node-rules.yml

groups:
  - name: alert-node
    rules:
    - alert: NodeDown
      # 注意:这里的job_name一定要跟prometheus配置文件中配置的相匹配
      expr: up{job="node-service"} == 0
      for: 1m
      labels:
        severity: critical
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance }} is down"
        description: "Instance: {{ $labels.instance }} 已经宕机 1分钟"
        value: "{{ $value }}"

重启prometheus容器:

docker restart prometheus

验证是否收到告警微信信息,我们将node_exporter关闭,在被监控的服务器操作:

docker stop node_exporter

然后刷新prometheus的页面,查看Alerts菜单,我们发现NodeDown规则处于PENDING状态,等待一分钟后再次刷新查看,已经变成了FIRING状态,这时候我们去查看微信信息:

Prometheus+Alertmanager+PrometheusAlert+飞书容器化部署实战

说明已经收到了告警信息。现在我们把它恢复,然后我们就收到了服务恢复的告警邮件:

docker start node_exporter

部署 pushgateway

由于pushgateway在公司k8s集群上已经部署,我这里直接使用,具体部署yaml如下:

# cat pushgateway.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  namespace: kube-mon
  name:  pushgateway
  labels:
    app:  pushgateway
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "9090"
spec:
  replicas: 1
  revisionHistoryLimit: 0
  selector:
    matchLabels:
      app:  pushgateway
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: "25%"
      maxUnavailable: "25%"
  template:
    metadata:
      name:  pushgateway
      labels:
        app:  pushgateway
    spec:
      containers:
        - name:  pushgateway
          image: prom/pushgateway
          imagePullPolicy: IfNotPresent
          args:
          - "--web.telemetry-path=/metrics"
          - "--persistence.interval=5m"
          livenessProbe:
            initialDelaySeconds: 600
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 10
            httpGet:
              path: /
              port: 9091
          ports:
            - name: "http"
              containerPort: 9091
          resources:
            limits:
              memory: "1000Mi"
              cpu: 1
            requests:
              memory: "1000Mi"
              cpu: 1

---

apiVersion: v1
kind: Service
metadata:
  name: pushgateway
  namespace: kube-mon
  labels:
    app: pushgateway
spec:
  selector:
    app: pushgateway
  type: NodePort
  ports:
    - name: pushgateway
      port: 9091
      targetPort: http
      nodePort: 30037
# apply pushgateway,注意kube-mon命名空间,无则手动创建下
kubectl apply -f pushgateway.yaml
# 查看 pushgateway 状态
kubectl get pod -n kube-mon
kubectl get svc -n kube-mon

在Prometheus服务器端添加配置文件:

- job_name: 'pushgateway'
    scrape_interval: 10s
    scrape_timeout: 10s
    metrics_path: /metrics
    static_configs:
    - targets: ["your-ip:30037"] # 实际IP地址

重启 Prometheus 容器:

docker restart prometheus

验证 pushgateway 状态,浏览器登录Prometheus,查看Status --> Targets:

Prometheus+Alertmanager+PrometheusAlert+飞书容器化部署实战

部署 PrometheusAlert

这部分详见:Kubernetes中部署PrometheusAlert并使用mysql作后端存储,这里仅仅为了测试,可将原文中后端存储默认使用sqlite3。

Prometheus+Alertmanager+PrometheusAlert+飞书容器化部署实战

总结

到这里,我们已经把基础容器环境部署完毕,后面将测试和验证:

(1)Alertmanager结合PrometheusAlert模板做基于标签路由分发;

(2)pushgateway自定义指标监控和使用PrometheusAlert自定义模板;

(3)不同通知,告警情况下PrometheusAlert自定义模板使用比较;

参考文档