目录

  • 架构
  • 监控
  • 微服务监控
  • 告警
  • 系统中间件监控
  • 部署及规则脚本
  • grafana 安装:
  • prometheus 安装:
  • 删除某个指标
  • prometheus 告警规则配置
  • Alertmanager 报警
  • node-exporter 安装 (服务器监控)
  • cadvisor 安装 (docker监控)


架构

搭建一个通用监控告警平台,架构上需要有哪些设计

监控

监控系统选型

监控组件

异常通知

金融场景下的监控体系设计与实践

微服务监控

Spring Boot 监控平台 Prometheus + Grafana 入门

监控实战: Spring Boot应用监控最佳实践

如何使用 Prometheus 和 Grafana 监控预警服务集群?

SpringBoot健康检查,如何与容器配合?

SpringBoot+Prometheus+Grafana打造可视化监控,一目了然!

Spring Boot 埋点监控

微服务业务监控和行为分析怎么做?试试日志埋点

可视化监控系统,结合SpringBoot使用

Spring Boot 监控端点 Actuator 安全认证配置

自从上了Prometheus,睡觉真香!

监控利器之Prometheus问题处理

监控 ---- Spring Boot + JVM + Druid + Prometheus + Grafana

告警

Prometheus+Grafana+Alertmanager实现告警推送教程

芋道 Prometheus + Grafana + Alertmanager

PrometheusAlert 开源的运维告警中心消息转发系统

prometheus告警插件-alertmanager、自定义webhook案例编写

prometheus告警规则

prometheus告警规则配置

prometheus告警配置 企业微信告警配置

系统中间件监控

基于Prometheus构建MySQL可视化监控平台

基于Prometheus构建黑盒监控体系

部署及规则脚本

docker 安装 grafana

grafana 安装:

docker stop grafana
docker rm grafana
docker run -d \
      --name=grafana \
      -p 3000:3000 \
      -e "GF_SECURITY_ADMIN_PASSWORD=admin" \
      -e "GF_INSTALL_PLUGINS=grafana-clock-panel,grafana-simple-json-datasource,raintank-worldping-app,grafana-piechart-panel" \
      -v /data0/grafana/data:/var/lib/grafana \
      -v /etc/localtime:/etc/localtime:ro \
      grafana/grafana

如果出现如下错误,通过 chmod 777 /data0/grafana/data 解决

GF_PATHS_DATA=‘/var/lib/grafana’ is not writable.
You may have issues with file permissions, more information here: http://docs.grafana.org/installation/docker/#migration-from-a-previous-version-of-the-docker-container-to-5-1-or-later
mkdir: cannot create directory ‘/var/lib/grafana/plugins’: Permission denied
GF_PATHS_DATA=‘/var/lib/grafana’ is not writable.

prometheus 安装:

docker pull prom/prometheus
docker run -d --name prometheus -p 9090:9090 prom/prometheus

cd /data0
mkdir prometheus
cd prometheus
mkdir data
chmod 777 -R data

docker cp prometheus:/etc/prometheus/prometheus.yml .

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']
      labels:
        instance: prometheus

docker rm -f prometheus

docker run --name prometheus --network=host -d --restart=always -v /data0/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml -v /data0/prometheus/rules:/etc/prometheus/rules -v /data0/prometheus/:/prometheus -v /etc/localtime:/etc/localtime:ro prom/prometheus --web.enable-admin-api --config.file=/etc/prometheus/prometheus.yml

--web.enable-admin-api  # 控制对admin HTTP API的访问,其中包括删除时间序列等功能
--storage.tsdb.retention=30d # 设置保留天数 默认15天
-–storage.tsdb.path=“/data/prometheus” #指定数据存储路径
--web.enable-lifecycle  # 支持热更新,直接执行localhost:9090/-/reload立即生效

如果报如下错误:
Get http://10.41.24.3:9090/actuator/prometheus: context deadline exceeded
则需要启动命令 -p 9090:9090 替换成 --net=host

删除某个指标

控制管理 API 启用后,可以使用下面的语法来删除与某个标签匹配的所有时间序列指标:
curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={kubernetes_name="redis"}'
上面命令就可以用于删除具有标签kubernetes_name="redis"的时间序列指标。

如果要删除一些 job 任务或者 instance 的数据指标,则可以使用下面的命令:
curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={job="kubernetes-service-endpoints"}'

curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={instance="10.244.2.158:9090"}'

要从 Prometheus 中删除所有的数据,可以使用如下命令:
curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={__name__=~".+"}'

不过需要注意的是上面的 API 调用并不会立即删除数据,实际数据任然还存在磁盘上,会在后面进行数据清理。
要确定何时删除旧数据,可以使用--storage.tsdb.retention参数进行配置(默认情况下,Prometheus 会将数据保留15天)

清理db
curl -X POST http://localhost:9090/api/v1/admin/tsdb/clean_tombstones

prometheus 告警规则配置

prometheus规则配置脚本

prometheus自定义监控:监控接口的实时调用情况

Prometheus 做Post 接口请求监控-api接口监控

Prometheus告警规则配置总结

Prometheus配置监控ip、端口连通,get、post接口连通和状态码

常用prometheus告警规则模板

cd /data0/prometheus/rules

vi common_rule.yml

groups:
- name: common_rule
  rules:
  - alert: "服务挂了"
    expr: up == 0
    for: 60s
    labels:
      severity: critical
    annotations:
      summary: "服务挂了!"
      description: "已停止运行超过 60秒!赶紧解决啊!"
  - alert: "服务挂了5分钟了"
    expr: up == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "服务挂了"
      description: "已停止运行超过 5分钟!赶紧解决啊!"
  - alert: "服务器堆空间使用率过高"
    expr: round(sum(jvm_memory_used_bytes{area="heap"})*100/sum (jvm_memory_max_bytes{area="heap"}),0.01) >90
    for: 30s
    labels:
      severity: warring
    annotations:
      summary: "堆空间使用率过高"
      description:  "当前堆空间使用率:{{ $value }}%"
  - alert: "服务器cpu使用率过高"
    expr: round(system_cpu_usage{ },0.0001)*100>50
    for: 30s
    labels:
      severity: warring
    annotations:
      summary: "cpu使用率过高"
      description:  "当前cpu使用率:{{ $value }}%"
  - alert: "接口请求超时"
    expr: round(irate(http_server_requests_seconds_sum{exception="None", uri!~".*actuator.*"}[5m]) / irate(http_server_requests_seconds_count{exception="None", uri!~".*actuator.*"}[5m]),0.01)>15
    for: 30s
    labels:
      severity: warring
    annotations:
      summary: "{{$labels.uri}}:请求超时"
      description:  "请求响应时间:{{ $value }}s"
  - alert: "接口请求异常"
    expr: increase(http_server_requests_seconds_count{exception!="None",exception!="BusinessException"}[1m]) > 5
    for: 1m
    labels:
      severity: warring
    annotations:
      summary: "1分钟内接口请求异常告警"
      description: "{{$labels.uri}}的接口发生了{{ $labels.exception }}异常,阙值: 5, 当前值:{{ $value }}"
  - alert: "tomcat活跃线程数太高"
    expr: tomcat_threads_busy_threads{ } / tomcat_threads_config_max_threads{ } > 0.8
    for: 15s
    labels:
      severity: warring
    annotations:
      summary: "监控tomcat活动线程占总线程的比例"
      description: "tomcat活动线程占总线程的比例超过了设置的阈值:80%,当前值{{ $value}}"

代码自定义prometheus业务告警规则
vi special_rule.yml

groups:
- name: special_rule
  rules:
  - alert: "返回状态码增长异常"
    expr: ceil(increase(return_total{code!="0"}[5m])) > 10
    for: 30s
    labels:
      severity: warring
    annotations:
      summary: "{{$labels.code}}-{{$labels.msg}}:正在快速增加"
      description:  "{{$labels.code}}-{{$labels.msg}}:5分钟内增加数量:{{ $value }}"

Alertmanager 报警

搭建Prometheus监控报警及自定义邮件模板

基于Alertmanager设计告警降噪系统

Alertmanager 安装 (报警)

docker run -d -p 9093:9093 --name alertmanager  prom/alertmanager

docker cp alertmanager:/etc/alertmanager/alertmanager.yml /data0/alertmanager/alertmanager.yml

docker rm -f alertmanager

docker run -d -p 9093:9093 --name alertmanager -v /data0/alertmanager:/etc/alertmanager -v /etc/localtime:/etc/localtime:ro  prom/alertmanager

# 使用 Alertmanager MANAGEMENT API
curl -X POST curl -X POST http://127.0.0.1:9093/-/reload

#alertmanager.yml配置
cat alertmanager.yml
{{ define "email.to.html" }}
global:
  resolve_timeout: 5m
  smtp_smarthost: 'mail.staff.s.com.cn:25'
  smtp_from: 'mail_alert@staff.s.com'
  smtp_auth_username: 'mail_alert'
  smtp_auth_password: '*******'
  smtp_require_tls: false
templates:
  - '/etc/alertmanager/email.tmpl'
route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 1m
  repeat_interval: 5m
  receiver: 'bop'
receivers:
{{ define "email.to.html" }}
- name: 'bop'
  email_configs:
    - to: 'lixin37@staff.sina.com,tengyu5@staff.sina.com,luopeng1@staff.sina.com,zhaoyang10@staff.sina.com'
      html: '{{ template "email.to.html" . }}' # 设定邮箱的内容模板
  webhook_configs:
    - url: 'http://10.182.29.9:8081/prometheusalert?type=wx&tpl=prometheus-wx'
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']


#其他参考配置内容
global:
  smtp_smarthost: 'smtp.163.com:25'  #163服务器
  smtp_from: 'XXX@163.com'        #发邮件的邮箱
  smtp_auth_username: 'XXX@163.com'  #发邮件的邮箱用户名,也就是你的邮箱
  smtp_auth_password: 'XXX'        #发邮件的邮箱密码

route:
  group_by: ['alertname']
  repeat_interval: 1h
  receiver: live-monitoring

receivers:
- name: 'live-monitoring'
  email_configs:
  - to: 'czh1226@qq.com'        #收邮件的邮箱
 

# 修改prometheus配置文件prometheus.yml,开启报警功能,添加报警规则配置文件
# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets: ["localhost:9093"]
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  #- "node_down.yml"
  #- "memory_over.yml"
  - rules/*.yml
 
# node_down.yml内容
groups:
- name: example
  rules:
  - alert: InstanceDown
    expr: up == 0
    for: 1m
    labels:
      user: caizh
    annotations:
      summary: "Instance {{ $labels.instance }} down"
      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."

# memory_over.yml 内容
groups:
- name: example
  rules:
  - alert: NodeMemoryUsage
    expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 80
    for: 1m
    labels:
      user: caizh
    annotations:
      summary: "{{$labels.instance}}: High Memory usage detected"
      description: "{{$labels.instance}}: Memory usage is above 80% (current value is:{{ $value }})"

# 重启 prometheus
docker run -d -p 9090:9090 --name=prometheus \
-v /data0/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml \
-v /data0/prometheus/node_down.yml:/etc/prometheus/node_down.yml \
-v /data0/prometheus/memory_over.yml:/etc/prometheus/memory_over.yml \
prom/prometheus

邮件告警点击 “View in AlertManager ”的时候,会调到以alertmanager的主机名的url地址上,这样会出现各种异常(如打不开这个网址)
解决方法:
./alertmanager --web.external-url=‘http://192.168.2.4:9093/’ ### web.external-url 这个地方显式设置下即可

alertmanager 告警模板配置:
cd /data0/alertmanager/
vi email.tmpl

{{ define "email.to.html" }}
{{ range .Alerts }}
  <table width="500px" style="border:1px solid grey; border-collapse:collapse; padding:2px;font-size:13px">
    <caption style="border:1px solid grey; border-collapse:collapse; padding:2px;font-weight:bold; font-size:15px">
      <a href="http://mon.bop.weibo.com/d/Gh9j_fD7z/wei-fu-wu-jian-kong?orgId=1&refresh=30s&var-application={{ .Labels.application}}&var-instance={{ .Labels.instance}}&from=now-1h&to=now">告警 {{ .Labels.job}}</a>
    </caption>
    <tr style="background-color:rgb(241,241,241)"> <td>告警程序</td> <td>{{ .Labels.job}}</td></tr>
    <tr style="background-color:rgb(255,255,255)"> <td>告警级别</td> <td>{{ .Labels.severity }}</td></tr>
    <tr style="background-color:rgb(241,241,241)"> <td>告警类型</td> <td>{{ .Labels.alertname }}</td></tr>
    <tr style="background-color:rgb(255,255,255)"> <td>故障主机</td> <td>{{ .Labels.instance }}</td></tr>
    <tr style="background-color:rgb(241,241,241)"> <td>告警主题</td> <td><font style="color:red;">{{ .Annotations.summary }}</font></td></tr>
    <tr style="background-color:rgb(255,255,255)"> <td>告警详情</td> <td>{{ .Annotations.description}}</td></tr>
    <tr style="background-color:rgb(241,241,241)"> <td>触发时间</td> <td>{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}</td></tr>
  </table>
  <br/>
  <br/>
{{ end }}
{{ end }}

Alertmanager 通知模板Prometheus Alertmanager告警模板

node-exporter 安装 (服务器监控)

docker run -d \
        --name node-exporter \
        -p 9100:9100 \
        -v "/proc:/host/proc" \
        -v "/sys:/host/sys" \
        -v "/:/rootfs" \
        --net="host" \
        prom/node-exporter

# 修改prometheus的配置
- job_name: 'node1'
    static_configs:
    - targets: ['192.168.28.130:9100']
      labels:
        env: test
        name: node1
        instance: 192.168.28.130

cadvisor 安装 (docker监控)

docker run \
	-v /var/run:/var/run:rw  \
	-v /sys:/sys:ro  \ 
	-v /var/lib/docker:/var/lib/docker:ro \ 
	-p 8080:8080  \
	-d  --name cadvisor google/cadvisor