5.prometheus告警插件-alertmanager
(主要)

prometheus本身不支持告警功能,主要通过插件alertmanage来实现告警。AlertManager用于接收Prometheus发送的告警并对于告警进行一系列的处理后发送给指定的用户。

prometheus触发一条告警的过程:

prometheus—>触发阈值—>超出持续时间—>alertmanager—>分组|抑制|静默—>媒体类型—>邮件|钉钉|微信等。

prometheus zookeeper 告警 prometheus告警功能_json

5.1.prometheus+alertmanager+webhook实现自定义监控报警系统

以下主要参考:

prometheus+grafana+mtail+node_exporter实现机器负载及业务监控()介绍了使用mtail和node_exporter实现的prometheus无埋点监控机器负载和业务的监控系统,本文是在其基础上实现自定义报警功能。

Prometheus + Alertmanager的警报分为两个部分:
Prometheus负责中配置警报规则,将警报发送到Alertmanager。
Alertmanager负责管理这些警报,包括沉默,抑制,合并和发送通知。

Alertmanager 发送通知有多种方式,其内部集成了邮箱、Slack、企业微信等三种方式,也提供了webhook的方式来扩展报警通知方式,网上也有大量例子实现对第三方软件的集成,如钉钉等。本文介绍邮件报警方式和通过使用java来搭建webhook自定义通知报警的方式。

本文内容主要分为四块:
prometheus报警规则配置
alertmanager配置及部署
关联prometheus和alertmanager
配置报警通知方式

5.1.1.Prometheus配置报警规则

Prometheus.yml属性配置

scrpe_interval

样本采集周期,默认为1分钟采集一次。

evaluation_interval

告警规则计算周期,默认为1分钟计算一次。

rule_files

指定告警规则的文件

scrape_configs

job的配置项,里面可配多组job任务。

job_name

任务名称,需要唯一性

static_configs

job_name的配置选项,一般使用file_sd_configs 热加载配置。

file_sd_configs

job_name的动态配置选项,使用此配置可以实现配置文件的热加载。

files

file_sd_configs配置的服务发现的文件路径列表,支持.json,.yml或.yaml,路径最后一层支持通配符*

refresh_interval

file_sd_configs中的files重新加载的周期,默认5分钟

此处我们使用rule_files属性来设置告警文件(在prometheus.yml中配置如下)

# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets: ["172.17.0.2:9093"]

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
# 告警规则中可以指定多个,并且可以使用通配符*
rule_files:
  - "rules/host_rules.yml"
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["172.17.0.2:9090"]

  - job_name: 'node_exporter'
    static_configs:
      - targets: ['172.17.0.2:8080']


  - job_name: 'push-metrics'
    static_configs:
      - targets: ['172.17.0.2:9091']
        labels:
          instance: pushgateway

在prometheus中设置告警规则,rules/host_rules.yml

groups:
# 报警组组名称
- name: hostStatsAlert
  #报警组规则
  rules:
   #告警名称,需唯一
  - alert: hostCpuUsageAlert
    #promQL表达式
    expr: sum(avg without (cpu)(irate(node_cpu_seconds_total{mode!='idle'}[5m]))) by (instance) > 0.85
    #满足此表达式持续时间超过for规定的时间才会触发此报警
    for: 1m
    labels:
      #严重级别
      severity: page
    annotations:
     #发出的告警标题
      summary: "实例 {{ $labels.instance }} CPU 使用率过高"
      #发出的告警内容
      description: "实例{{ $labels.instance }} CPU 使用率超过 85% (当前值为: {{ $value }})"
  - alert: hostMemUsageAlert
    expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)/node_memory_MemTotal_bytes > 0.85
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "实例 {{ $labels.instance }} 内存使用率过高"
      description: "实例 {{ $labels.instance }} 内存使用率 85% (当前值为: {{ $value }})"

配置完规则之后,访问:http://localhost:19090/alerts,可以看到:

prometheus zookeeper 告警 prometheus告警功能_json_02

5.1.2.alertmanager下载、安装、启动

tar -zxvf alertmanager-0.22.2.linux-amd64.tar.gz -C /root/installed/

cd /root/installed/alertmanager
nohup ./alertmanager --config.file=alertmanager.yml > alertmanager.file 2>&1 &

服务器上访问路径:

http://localhost:9093/

本机上的访问路径:

http://localhost:19093/#/alerts

prometheus zookeeper 告警 prometheus告警功能_java_03

5.1.3.创建alertmanager配置文件

Alertmanager解压后会包含一个默认的alertmanager.yml配置文件,内容如下所示:

route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'web.hook'
receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'http://127.0.0.1:5001/'
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

Alertmanager的配置主要包含两个部分:路由(route)以及接收器(receivers)。所有的告警信息都会从配置中的顶级路由(route)进入路由树,根据路由规则将告警信息发送给相应的接收器。

5.1.4.关联Prometheus与Alertmanager

prometheus.yml中的alerting标签下配置上alertmanager的地址即可,配置如下(此步上面已经配置了,下面只是作为部署时的参考):

alerting:
  alertmanagers:                #配置alertmanager
    - static_configs:
        - targets:
          - 172.17.0.2:9093      # alertmanager服务器ip端口

rule_files:
  - "rules/*.yml"

5.1.5.配置报警通知方式

5.1.5.1.alertmanager邮箱报警demo

以下是alertmanager.yml中的配置:

global:
  #超时时间
  resolve_timeout: 5m
  #smtp地址需要加端口
  smtp_smarthost: 'smtp.126.com:25'
  smtp_from: 'xxx@126.com'
  #发件人邮箱账号
  smtp_auth_username: 'xxx@126.com'
  #账号对应的授权码(不是密码),阿里云个人版邮箱目前好像没有授权码,126邮箱授权码可以在“设置”里面找到
  smtp_auth_password: '1qaz2wsx'
  smtp_require_tls: false

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 1m
  repeat_interval: 4h
  receiver: 'mail'
receivers:
- name: 'mail'
  email_configs:
  - to: 'xxx@aliyun.com'

设置后如果有通知,即可收到邮件如下:

prometheus zookeeper 告警 prometheus告警功能_json_04

5.1.5.2.alertmanager使用webhook(java)报警demo

此时要将alertmanager.yml修改成:

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1m
  receiver: 'webhook'
  routes:
    - receiver: webhook
      group_wait: 10s

receivers:
- name: 'webhook'
  webhook_configs:
  # 下面的url是自定义springboot项目中接口的访问url地址
  - url: 'http://172.17.0.2:8060/demo'
    send_resolved: true
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

使用webhook方式,alertmanager会给配置的webhook地址发送一个http类型的post请求,参数为json字符串(字符串类型),如下(此处格式化为json了):

{
    "receiver":"webhook",
    "status":"resolved",
    "alerts":[
        {
            "status":"resolved",
            "labels":{
                "alertname":"hostCpuUsageAlert",
                "instance":"192.168.199.24:9100",
                "severity":"page"
            },
            "annotations":{
                "description":"192.168.199.24:9100 CPU 使用率超过 85% (当前值为: 0.9973333333333395)",
                "summary":"机器 192.168.199.24:9100 CPU 使用率过高"
            },
            "startsAt":"2020-02-29T19:45:21.799548092+08:00",
            "endsAt":"2020-02-29T19:49:21.799548092+08:00",
            "generatorURL":"http://localhost.localdomain:9090/graph?g0.expr=sum+by%28instance%29+%28avg+without%28cpu%29+%28irate%28node_cpu_seconds_total%7Bmode%21%3D%22idle%22%7D%5B5m%5D%29%29%29+%3E+0.85&g0.tab=1",
            "fingerprint":"368e9616d542ab48"
        }
    ],
    "groupLabels":{
        "alertname":"hostCpuUsageAlert"
    },
    "commonLabels":{
        "alertname":"hostCpuUsageAlert",
        "instance":"192.168.199.24:9100",
        "severity":"page"
    },
    "commonAnnotations":{
        "description":"192.168.199.24:9100 CPU 使用率超过 85% (当前值为: 0.9973333333333395)",
        "summary":"机器 192.168.199.24:9100 CPU 使用率过高"
    },
    "externalURL":"http://localhost.localdomain:9093",
    "version":"4",
    "groupKey":"{}:{alertname="hostCpuUsageAlert"}"
}

此时需要使用java(其他任何语言都可以,反正只要能处理http的请求就行)搭建个http的请求处理器来处理报警通知,如下(以下代码示例展示了接收host_rules.yml规则告警得到的数据的方式):

package com.demo.demo1.controller;

import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
import lombok.extern.slf4j.Slf4j;
import org.apache.commons.lang3.StringUtils;
import org.springframework.stereotype.Controller;
import org.springframework.web.bind.annotation.RequestBody;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.ResponseBody;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;


@Slf4j
@Controller
@RequestMapping("/")
public class AlertController {


    @RequestMapping(value = "/demo", produces = "application/json;charset=UTF-8")
    @ResponseBody
    public String pstn(@RequestBody String json) {
        log.debug("alert notify  params: {}", json);
        Map<String, Object> result = new HashMap<>();
        result.put("msg", "报警失败");
        result.put("code", 0);

        if(StringUtils.isBlank(json)){
            return JSON.toJSONString(result);
        }
        JSONObject jo = JSON.parseObject(json);

        JSONObject commonAnnotations = jo.getJSONObject("commonAnnotations");
        String status = jo.getString("status");
        if (commonAnnotations == null) {
            return JSON.toJSONString(result);
        }


        String subject = commonAnnotations.getString("summary");
        String content = commonAnnotations.getString("description");
        List<String> emailusers = new ArrayList<>();
        emailusers.add("xxx@aliyun.com");

        List<String> users = new ArrayList<>();
        users.add("158*****5043");


        try {
            boolean success = Util.email(subject, content, emailusers);
            if (success) {
                result.put("msg", "报警成功");
                result.put("code", 1);
            }
        } catch (Exception e) {
            log.error("=alert email notify error. json={}", json, e);
        }
        try {
            boolean success = Util.sms(subject, content, users);
            if (success) {
                result.put("msg", "报警成功");
                result.put("code", 1);
            }
        } catch (Exception e) {
            log.error("=alert sms notify error. json={}", json, e);
        }


        return JSON.toJSONString(result);
    }

}
5.1.5.3.完整简单的SpringBoot工程案例
5.1.5.3.1.工程结构

prometheus zookeeper 告警 prometheus告警功能_spring_05

5.1.5.3.2.pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
   <modelVersion>4.0.0</modelVersion>
   <parent>
      <groupId>org.springframework.boot</groupId>
      <artifactId>spring-boot-starter-parent</artifactId>
      <version>2.3.5.RELEASE</version>
      <relativePath/> <!-- lookup parent from repository -->
   </parent>
   <groupId>com.example</groupId>
   <artifactId>demo</artifactId>
   <version>0.0.1-SNAPSHOT</version>
   <name>demo</name>
   <description>Demo project for Spring Boot</description>
   <properties>
      <java.version>1.8</java.version>
   </properties>
   <dependencies>
      <dependency>
         <groupId>org.springframework.boot</groupId>
         <artifactId>spring-boot-starter-web</artifactId>
      </dependency>

      <dependency>
         <groupId>com.h2database</groupId>
         <artifactId>h2</artifactId>
         <scope>runtime</scope>
      </dependency>
      <dependency>
         <groupId>org.projectlombok</groupId>
         <artifactId>lombok</artifactId>
         <optional>true</optional>
      </dependency>
      <dependency>
         <groupId>org.springframework.boot</groupId>
         <artifactId>spring-boot-starter-test</artifactId>
         <scope>test</scope>
      </dependency>
      <!-- JSON Configuration -->
      <dependency>
         <groupId>com.alibaba</groupId>
         <artifactId>fastjson</artifactId>
         <version>1.2.6</version>
      </dependency>

      <!--<dependency>
         <groupId>org.apache.commons</groupId>
         <artifactId>commons-lang3</artifactId>
         <version>3.11</version>
      </dependency>-->

   </dependencies>

   <build>
      <plugins>
         <plugin>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-maven-plugin</artifactId>
            <configuration>
               <excludes>
                  <exclude>
                     <groupId>org.projectlombok</groupId>
                     <artifactId>lombok</artifactId>
                  </exclude>
               </excludes>
            </configuration>
         </plugin>
      </plugins>
   </build>
   <repositories>
      <repository>
         <id>spring-milestones</id>
         <name>Spring Milestones</name>
         <url>https://repo.spring.io/milestone</url>
         <snapshots>
            <enabled>false</enabled>
         </snapshots>
      </repository>
      <repository>
         <id>spring-snapshots</id>
         <name>Spring Snapshots</name>
         <url>https://repo.spring.io/snapshot</url>
         <releases>
            <enabled>false</enabled>
         </releases>
      </repository>
   </repositories>
   <pluginRepositories>
      <pluginRepository>
         <id>spring-milestones</id>
         <name>Spring Milestones</name>
         <url>https://repo.spring.io/milestone</url>
         <snapshots>
            <enabled>false</enabled>
         </snapshots>
      </pluginRepository>
      <pluginRepository>
         <id>spring-snapshots</id>
         <name>Spring Snapshots</name>
         <url>https://repo.spring.io/snapshot</url>
         <releases>
            <enabled>false</enabled>
         </releases>
      </pluginRepository>
   </pluginRepositories>

</project>
5.1.5.3.3.AlertController
package com.example.demo;

import com.alibaba.fastjson.JSON;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.stereotype.Controller;
import org.springframework.web.bind.annotation.RequestBody;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.ResponseBody;

import java.util.HashMap;
import java.util.Map;

@Controller
@RequestMapping("/")
public class AlertController {

    private final static Logger logger = LoggerFactory.getLogger(BlogAction.class);

    @RequestMapping(value = "/demo", produces = "application/json;charset=UTF-8")
    @ResponseBody
    public String pstn(@RequestBody String json) {
        //直接将结果存入到log文件中
        logger.error(json);

        Map<String, Object> result = new HashMap<>();
        result.put("msg", "报警失败");
        result.put("code", 0);

//        if(StringUtils.isBlank(json)){
//            return JSON.toJSONString(result);
//        }
//        JSONObject jo = JSON.parseObject(json);
//
//        JSONObject commonAnnotations = jo.getJSONObject("commonAnnotations");
//        String status = jo.getString("status");
//        if (commonAnnotations == null) {
//            return JSON.toJSONString(result);
//        }
//
//        String subject = commonAnnotations.getString("summary");
//        String content = commonAnnotations.getString("description");
//
//        result.put("subject",subject);
//        result.put("content",content);

        return JSON.toJSONString(result);
    }

}
5.1.5.3.4.DemoApplication
package com.example.demo;

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;

@SpringBootApplication
public class DemoApplication {

   public static void main(String[] args) {
      SpringApplication.run(DemoApplication.class, args);
   }

}
5.1.5.3.5.application.properties
server.port=8060
server.tomcat.uri-encoding=utf-8
5.1.5.3.6.logback.xml
<?xml version="1.0" encoding="UTF-8"?>

<configuration>
    <appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
        <encoder>
            <pattern>
                %d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n
            </pattern>
        </encoder>
    </appender>

    <!-- 日志记录器,日期滚动记录 -->
    <appender name="fileInfoApp" class="ch.qos.logback.core.rolling.RollingFileAppender">
        <!-- 正在记录的日志文件的路径及文件名 -->
        <!-- <file>${LOG_PATH}/warn/log_warn.log</file> -->
        <!-- 日志记录器的滚动策略,按日期,按大小记录 -->
        <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
            <!-- 归档的日志文件的路径,例如今天是2013-12-21日志,当前写的日志文件路径为file节点指定,可以将此文件与file指定文件路径设置为不同路径,从而将当前日志文件或归档日志文件置不同的目录。
            而2013-12-21的日志文件在由fileNamePattern指定。%d{yyyy-MM-dd}指定日期格式,%i指定索引 -->
            <fileNamePattern>log/log-error-%d{yyyy-MM-dd}.%i.log</fileNamePattern>
            <!-- 表示只保留最近30天的日志,以防止日志填满整个磁盘空间。-->
            <maxHistory>30</maxHistory>
            <!--用来指定日志文件的上限大小,例如设置为1GB的话,那么到了这个值,就会删除旧的日志。-->
            <totalSizeCap>1GB</totalSizeCap>
            <!-- 除按日志记录之外,还配置了日志文件不能超过2M,若超过2M,日志文件会以索引0开始,命名日志文件,例如log-error-2013-12-21.0.log -->
            <timeBasedFileNamingAndTriggeringPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedFNATP">
                <maxFileSize>2MB</maxFileSize>
            </timeBasedFileNamingAndTriggeringPolicy>
        </rollingPolicy>
        <!-- 追加方式记录日志 -->
        <append>true</append>
        <!-- 日志文件的格式 -->
        <encoder class="ch.qos.logback.classic.encoder.PatternLayoutEncoder">
            <pattern>===%d{yyyy-MM-dd HH:mm:ss.SSS} %-5level %logger Line:%-3L - %msg%n</pattern>
            <charset>utf-8</charset>
        </encoder>
        <!-- 此日志文件只记录war级别的 -->
        <filter class="ch.qos.logback.classic.filter.LevelFilter">
            <!-- 只保留error日志 -->
            <!-- level:debug,info,warn,error -->
            <level>ERROR</level>
            <onMatch>ACCEPT</onMatch>
            <onMismatch>DENY</onMismatch>
        </filter>
    </appender>

    <!-- root节点要放到appender之后 -->
    <root level="INFO">
        <appender-ref ref="STDOUT" />
        <appender-ref ref="fileInfoApp" />
    </root>
</configuration>
5.1.5.3.7.打包、运行、查看日志

在IDEA的terminal中打包:

mvn clean install

将打的包demo-0.0.1-SNAPSHOT.jar放到/root/workspace,如下:

prometheus zookeeper 告警 prometheus告警功能_java_06

其中start.sh中的内容如下:

[root@node1 workspace]# cat start.sh
cd /root/workspace

nohup java -jar demo-0.0.1-SNAPSHOT.jar  > demo.log 2>&1 &

查看log,可以看到具体的内容(此处略),具体的报警json格式如下:
使用webhook方式,alertmanager会给配置的webhook地址发送一个http类型的post请求,参数为json字符串(字符串类型),如下(此处格式化为json了):

{
    "receiver":"webhook",
    "status":"resolved",
    "alerts":[
        {
            "status":"resolved",
            "labels":{
                "alertname":"hostCpuUsageAlert",
                "instance":"192.168.199.24:9100",
                "severity":"page"
            },
            "annotations":{
                "description":"192.168.199.24:9100 CPU 使用率超过 85% (当前值为: 0.9973333333333395)",
                "summary":"机器 192.168.199.24:9100 CPU 使用率过高"
            },
            "startsAt":"2020-02-29T19:45:21.799548092+08:00",
            "endsAt":"2020-02-29T19:49:21.799548092+08:00",
            "generatorURL":"http://localhost.localdomain:9090/graph?g0.expr=sum+by%28instance%29+%28avg+without%28cpu%29+%28irate%28node_cpu_seconds_total%7Bmode%21%3D%22idle%22%7D%5B5m%5D%29%29%29+%3E+0.85&g0.tab=1",
            "fingerprint":"368e9616d542ab48"
        }
    ],
    "groupLabels":{
        "alertname":"hostCpuUsageAlert"
    },
    "commonLabels":{
        "alertname":"hostCpuUsageAlert",
        "instance":"192.168.199.24:9100",
        "severity":"page"
    },
    "commonAnnotations":{
        "description":"192.168.199.24:9100 CPU 使用率超过 85% (当前值为: 0.9973333333333395)",
        "summary":"机器 192.168.199.24:9100 CPU 使用率过高"
    },
    "externalURL":"http://localhost.localdomain:9093",
    "version":"4",
    "groupKey":"{}:{alertname="hostCpuUsageAlert"}"
}

5.1.6.prometheus中的其他报警规则配置案例

以下取自:

节点挂掉了的监控:node_down.yml

groups:
- name: example
  rules:  - alert: InstanceDown
    expr: up == 0
    for: 1m
    labels:
      user: caizh
    annotations:
      summary: "Instance {{ $labels.instance }} down"
      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."

节点内存使用率监控报警参考配置(memory_over.yml)

groups:
- name: example
  rules:
  - alert: NodeMemoryUsage
    expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 80
    for: 1m
    labels:
      user: caizh
    annotations:
      summary: "{{$labels.instance}}: High Memory usage detected"
      description: "{{$labels.instance}}: Memory usage is above 80% (current value is:{{ $value }})"

当然,想要监控节点内存需要提前配置好node_exporter

修改prometheus配置文件prometheus.yml,开启报警功能,添加报警规则配置文件

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets: ["localhost:9093"]
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "node_down.yml"
  - "memory_over.yml"