目录
- 架构
- 监控
- 微服务监控
- 告警
- 系统中间件监控
- 部署及规则脚本
- grafana 安装:
- prometheus 安装:
- 删除某个指标
- prometheus 告警规则配置
- Alertmanager 报警
- node-exporter 安装 (服务器监控)
- cadvisor 安装 (docker监控)
架构
搭建一个通用监控告警平台,架构上需要有哪些设计
监控
微服务监控
Spring Boot 监控平台 Prometheus + Grafana 入门
如何使用 Prometheus 和 Grafana 监控预警服务集群?
SpringBoot+Prometheus+Grafana打造可视化监控,一目了然!
可视化监控系统,结合SpringBoot使用
Spring Boot 监控端点 Actuator 安全认证配置
监控 ---- Spring Boot + JVM + Druid + Prometheus + Grafana
告警
Prometheus+Grafana+Alertmanager实现告警推送教程
芋道 Prometheus + Grafana + Alertmanager
PrometheusAlert 开源的运维告警中心消息转发系统
prometheus告警插件-alertmanager、自定义webhook案例编写
prometheus告警配置 企业微信告警配置
系统中间件监控
部署及规则脚本
docker 安装 grafana
grafana 安装:
docker stop grafana
docker rm grafana
docker run -d \
--name=grafana \
-p 3000:3000 \
-e "GF_SECURITY_ADMIN_PASSWORD=admin" \
-e "GF_INSTALL_PLUGINS=grafana-clock-panel,grafana-simple-json-datasource,raintank-worldping-app,grafana-piechart-panel" \
-v /data0/grafana/data:/var/lib/grafana \
-v /etc/localtime:/etc/localtime:ro \
grafana/grafana
如果出现如下错误,通过 chmod 777 /data0/grafana/data 解决
GF_PATHS_DATA=‘/var/lib/grafana’ is not writable.
You may have issues with file permissions, more information here: http://docs.grafana.org/installation/docker/#migration-from-a-previous-version-of-the-docker-container-to-5-1-or-later
mkdir: cannot create directory ‘/var/lib/grafana/plugins’: Permission denied
GF_PATHS_DATA=‘/var/lib/grafana’ is not writable.
prometheus 安装:
docker pull prom/prometheus
docker run -d --name prometheus -p 9090:9090 prom/prometheus
cd /data0
mkdir prometheus
cd prometheus
mkdir data
chmod 777 -R data
docker cp prometheus:/etc/prometheus/prometheus.yml .
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
labels:
instance: prometheus
docker rm -f prometheus
docker run --name prometheus --network=host -d --restart=always -v /data0/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml -v /data0/prometheus/rules:/etc/prometheus/rules -v /data0/prometheus/:/prometheus -v /etc/localtime:/etc/localtime:ro prom/prometheus --web.enable-admin-api --config.file=/etc/prometheus/prometheus.yml
--web.enable-admin-api # 控制对admin HTTP API的访问,其中包括删除时间序列等功能
--storage.tsdb.retention=30d # 设置保留天数 默认15天
-–storage.tsdb.path=“/data/prometheus” #指定数据存储路径
--web.enable-lifecycle # 支持热更新,直接执行localhost:9090/-/reload立即生效
如果报如下错误:
Get http://10.41.24.3:9090/actuator/prometheus: context deadline exceeded
则需要启动命令 -p 9090:9090 替换成 --net=host
删除某个指标
控制管理 API 启用后,可以使用下面的语法来删除与某个标签匹配的所有时间序列指标:
curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={kubernetes_name="redis"}'
上面命令就可以用于删除具有标签kubernetes_name="redis"的时间序列指标。
如果要删除一些 job 任务或者 instance 的数据指标,则可以使用下面的命令:
curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={job="kubernetes-service-endpoints"}'
curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={instance="10.244.2.158:9090"}'
要从 Prometheus 中删除所有的数据,可以使用如下命令:
curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={__name__=~".+"}'
不过需要注意的是上面的 API 调用并不会立即删除数据,实际数据任然还存在磁盘上,会在后面进行数据清理。
要确定何时删除旧数据,可以使用--storage.tsdb.retention参数进行配置(默认情况下,Prometheus 会将数据保留15天)
清理db
curl -X POST http://localhost:9090/api/v1/admin/tsdb/clean_tombstones
prometheus 告警规则配置
prometheus自定义监控:监控接口的实时调用情况
Prometheus 做Post 接口请求监控-api接口监控
Prometheus配置监控ip、端口连通,get、post接口连通和状态码
常用prometheus告警规则模板
cd /data0/prometheus/rules
vi common_rule.yml
groups:
- name: common_rule
rules:
- alert: "服务挂了"
expr: up == 0
for: 60s
labels:
severity: critical
annotations:
summary: "服务挂了!"
description: "已停止运行超过 60秒!赶紧解决啊!"
- alert: "服务挂了5分钟了"
expr: up == 0
for: 5m
labels:
severity: critical
annotations:
summary: "服务挂了"
description: "已停止运行超过 5分钟!赶紧解决啊!"
- alert: "服务器堆空间使用率过高"
expr: round(sum(jvm_memory_used_bytes{area="heap"})*100/sum (jvm_memory_max_bytes{area="heap"}),0.01) >90
for: 30s
labels:
severity: warring
annotations:
summary: "堆空间使用率过高"
description: "当前堆空间使用率:{{ $value }}%"
- alert: "服务器cpu使用率过高"
expr: round(system_cpu_usage{ },0.0001)*100>50
for: 30s
labels:
severity: warring
annotations:
summary: "cpu使用率过高"
description: "当前cpu使用率:{{ $value }}%"
- alert: "接口请求超时"
expr: round(irate(http_server_requests_seconds_sum{exception="None", uri!~".*actuator.*"}[5m]) / irate(http_server_requests_seconds_count{exception="None", uri!~".*actuator.*"}[5m]),0.01)>15
for: 30s
labels:
severity: warring
annotations:
summary: "{{$labels.uri}}:请求超时"
description: "请求响应时间:{{ $value }}s"
- alert: "接口请求异常"
expr: increase(http_server_requests_seconds_count{exception!="None",exception!="BusinessException"}[1m]) > 5
for: 1m
labels:
severity: warring
annotations:
summary: "1分钟内接口请求异常告警"
description: "{{$labels.uri}}的接口发生了{{ $labels.exception }}异常,阙值: 5, 当前值:{{ $value }}"
- alert: "tomcat活跃线程数太高"
expr: tomcat_threads_busy_threads{ } / tomcat_threads_config_max_threads{ } > 0.8
for: 15s
labels:
severity: warring
annotations:
summary: "监控tomcat活动线程占总线程的比例"
description: "tomcat活动线程占总线程的比例超过了设置的阈值:80%,当前值{{ $value}}"
代码自定义prometheus业务告警规则
vi special_rule.yml
groups:
- name: special_rule
rules:
- alert: "返回状态码增长异常"
expr: ceil(increase(return_total{code!="0"}[5m])) > 10
for: 30s
labels:
severity: warring
annotations:
summary: "{{$labels.code}}-{{$labels.msg}}:正在快速增加"
description: "{{$labels.code}}-{{$labels.msg}}:5分钟内增加数量:{{ $value }}"
Alertmanager 报警
基于Alertmanager设计告警降噪系统
Alertmanager 安装 (报警)
docker run -d -p 9093:9093 --name alertmanager prom/alertmanager
docker cp alertmanager:/etc/alertmanager/alertmanager.yml /data0/alertmanager/alertmanager.yml
docker rm -f alertmanager
docker run -d -p 9093:9093 --name alertmanager -v /data0/alertmanager:/etc/alertmanager -v /etc/localtime:/etc/localtime:ro prom/alertmanager
# 使用 Alertmanager MANAGEMENT API
curl -X POST curl -X POST http://127.0.0.1:9093/-/reload
#alertmanager.yml配置
cat alertmanager.yml
{{ define "email.to.html" }}
global:
resolve_timeout: 5m
smtp_smarthost: 'mail.staff.s.com.cn:25'
smtp_from: 'mail_alert@staff.s.com'
smtp_auth_username: 'mail_alert'
smtp_auth_password: '*******'
smtp_require_tls: false
templates:
- '/etc/alertmanager/email.tmpl'
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 1m
repeat_interval: 5m
receiver: 'bop'
receivers:
{{ define "email.to.html" }}
- name: 'bop'
email_configs:
- to: 'lixin37@staff.sina.com,tengyu5@staff.sina.com,luopeng1@staff.sina.com,zhaoyang10@staff.sina.com'
html: '{{ template "email.to.html" . }}' # 设定邮箱的内容模板
webhook_configs:
- url: 'http://10.182.29.9:8081/prometheusalert?type=wx&tpl=prometheus-wx'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
#其他参考配置内容
global:
smtp_smarthost: 'smtp.163.com:25' #163服务器
smtp_from: 'XXX@163.com' #发邮件的邮箱
smtp_auth_username: 'XXX@163.com' #发邮件的邮箱用户名,也就是你的邮箱
smtp_auth_password: 'XXX' #发邮件的邮箱密码
route:
group_by: ['alertname']
repeat_interval: 1h
receiver: live-monitoring
receivers:
- name: 'live-monitoring'
email_configs:
- to: 'czh1226@qq.com' #收邮件的邮箱
# 修改prometheus配置文件prometheus.yml,开启报警功能,添加报警规则配置文件
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ["localhost:9093"]
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
#- "node_down.yml"
#- "memory_over.yml"
- rules/*.yml
# node_down.yml内容
groups:
- name: example
rules:
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
user: caizh
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."
# memory_over.yml 内容
groups:
- name: example
rules:
- alert: NodeMemoryUsage
expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 80
for: 1m
labels:
user: caizh
annotations:
summary: "{{$labels.instance}}: High Memory usage detected"
description: "{{$labels.instance}}: Memory usage is above 80% (current value is:{{ $value }})"
# 重启 prometheus
docker run -d -p 9090:9090 --name=prometheus \
-v /data0/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml \
-v /data0/prometheus/node_down.yml:/etc/prometheus/node_down.yml \
-v /data0/prometheus/memory_over.yml:/etc/prometheus/memory_over.yml \
prom/prometheus
邮件告警点击 “View in AlertManager ”的时候,会调到以alertmanager的主机名的url地址上,这样会出现各种异常(如打不开这个网址)
解决方法:
./alertmanager --web.external-url=‘http://192.168.2.4:9093/’ ### web.external-url 这个地方显式设置下即可
alertmanager 告警模板配置:
cd /data0/alertmanager/
vi email.tmpl
{{ define "email.to.html" }}
{{ range .Alerts }}
<table width="500px" style="border:1px solid grey; border-collapse:collapse; padding:2px;font-size:13px">
<caption style="border:1px solid grey; border-collapse:collapse; padding:2px;font-weight:bold; font-size:15px">
<a href="http://mon.bop.weibo.com/d/Gh9j_fD7z/wei-fu-wu-jian-kong?orgId=1&refresh=30s&var-application={{ .Labels.application}}&var-instance={{ .Labels.instance}}&from=now-1h&to=now">告警 {{ .Labels.job}}</a>
</caption>
<tr style="background-color:rgb(241,241,241)"> <td>告警程序</td> <td>{{ .Labels.job}}</td></tr>
<tr style="background-color:rgb(255,255,255)"> <td>告警级别</td> <td>{{ .Labels.severity }}</td></tr>
<tr style="background-color:rgb(241,241,241)"> <td>告警类型</td> <td>{{ .Labels.alertname }}</td></tr>
<tr style="background-color:rgb(255,255,255)"> <td>故障主机</td> <td>{{ .Labels.instance }}</td></tr>
<tr style="background-color:rgb(241,241,241)"> <td>告警主题</td> <td><font style="color:red;">{{ .Annotations.summary }}</font></td></tr>
<tr style="background-color:rgb(255,255,255)"> <td>告警详情</td> <td>{{ .Annotations.description}}</td></tr>
<tr style="background-color:rgb(241,241,241)"> <td>触发时间</td> <td>{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}</td></tr>
</table>
<br/>
<br/>
{{ end }}
{{ end }}
Alertmanager 通知模板Prometheus Alertmanager告警模板
node-exporter 安装 (服务器监控)
docker run -d \
--name node-exporter \
-p 9100:9100 \
-v "/proc:/host/proc" \
-v "/sys:/host/sys" \
-v "/:/rootfs" \
--net="host" \
prom/node-exporter
# 修改prometheus的配置
- job_name: 'node1'
static_configs:
- targets: ['192.168.28.130:9100']
labels:
env: test
name: node1
instance: 192.168.28.130
cadvisor 安装 (docker监控)
docker run \
-v /var/run:/var/run:rw \
-v /sys:/sys:ro \
-v /var/lib/docker:/var/lib/docker:ro \
-p 8080:8080 \
-d --name cadvisor google/cadvisor