揭秘业务背后的守护者,真实业务场景带你领略运维技术的魅力!

看了很多技术大佬的博客,都是在讲技术,缺乏业务场景的构建,很多运维人员遇到问题找不到解决方案。因此我想分享真实的业务场景,大家一起沟通业务问题,快速的提升技术,快速升职加薪。

-----------------------正文开始-----------------------

搭建了prometheus+grafana后,可以将主机的监控数据通过grafana可视化展示。当监控达到设置的阈值之后,可以通过alertmanager+prometheus-webhook-dingtalk推送至钉钉群中,提醒运维人员进行关注并处理。

Prometheus 规则配置:

groups:

- name: consul-datacenter-node-alert
  rules:
  - alert: "探针未启动或宕机"
    expr: up{job =~ "node-nginx"}  == 0
    for: 30s
    labels:
      env: 生产
      app: datacenter-linux
    annotations:
      description: "Job:{{ $labels.job }}, Instance:{{ $labels.instance }}, Role:{{ $labels.role }}的探针未启动或宕机,当前值:{{ $value }}"



  - alert: "CPU使用率过高"
    expr: (100 - (avg by(job,instance,role) (irate(node_cpu_seconds_total{job =~ "node-nginx",mode="idle"}[5m])) * 100)) > 80
    for: 2m
    labels:
      env: 生产
      app: datacenter-linux
    annotations:
      description: "{{ $labels.instance }} ({{ $labels.role }}) 的 CPU使用率超过90% ,当前值:{{ $value }}"


  - alert: "内存使用率过高"
    expr: ((node_memory_MemTotal_bytes{job =~ "node-nginx"} - node_memory_MemAvailable_bytes{job =~ "node-nginx"}) / node_memory_MemTotal_bytes{job =~ "node-nginx"}) * 100 > 73
    for: 2m
    labels:
      env: 生产
      app: datacenter-linux
    annotations:
      description: "{{ $labels.instance }} ({{ $labels.role }}) 内存使用率超过90%,当前值:{{ $value }}"

  - alert: "磁盘使用率过高>=80%"
    expr: 100 - ((node_filesystem_avail_bytes{fstype=~"ext4|xfs|nfs|nfs4", job =~ "node-nginx"} * 100) / node_filesystem_size_bytes {fstype=~"ext4|xfs|nfs|nfs4", job =~ "node-nginx"}) >= 80
    for: 2m
    labels:
      env: 生产
      app: datacenter-linux
    annotations:
      description: "{{ $labels.instance }},({{ $labels.role }}) {{ $labels.mountpoint }}分区使用率已超过80%,当前值:{{ $value }}"


alertmanager:

1、下载alertmanager:

https://github.com/prometheus/alertmanager/tags

2、设置重启alertmanager

vim restart.sh
#!/bin/bash
pidnum=`ps aux|grep alertmanager|grep -v grep|awk -F ' ' '{print $2}'`
kill -9 ${pidnum}
nohup /opt/jiankong/alertmanager/alertmanager --config.file=/opt/jiankong/alertmanager/dingtalk.yml --storage.path=/opt/jiankong/alertmanager/alertmanager_data --web.external-url=http://192.168.210.75:9093 &

3、alertmanager配置检查:

./amtool check-config dingtalk.yml

4、alertmanager配置:

[root@localhost alertmanager]# cat dingtalk.yml
global:
 resolve_timeout: 10m

#templates:
#- './config/*.tmpl'

route:
 group_by: ['alertname']
 group_wait: 30s
 group_interval: 30s
 repeat_interval: 5m
 #repeat_interval: 5s
 receiver: 'default'
 routes:

 # 推送告警测试群
 - match:
 app: "datacenter-linux"
 receiver: 'webhook-dingtalk-alter-test'
 # 推送第二个告警测试群
 - match:
 app: "blackbox_icmp"
 receiver: 'blackbox-dingtalk-alter-test'
inhibit_rules:
- source_match:

receivers:

- name: 'webhook-dingtalk-alter-test'
 webhook_configs:
 - send_resolved: true
 # 临时-告警测试群
 url: http://192.168.210.75:8060/dingtalk/webhook1/send

- name: 'blackbox-dingtalk-alter-test'
 webhook_configs:
 - send_resolved: true
 # 临时-第二个告警群
 url: http://192.168.210.75:8060/dingtalk/webhook2/send
- name: 'default'
 webhook_configs:

 # 临时-告警测试群
 - url: http://192.168.210.75:8060/dingtalk/webhook1/send

prometheus-webhook-dingtalk

1、下载prometheus-webhook-dingtalk:

https://github.com/timonwong/prometheus-webhook-dingtalk/releases/tag

2、Prometheus-webhook-dingtalk 配置:

[root@localhost webhook-dingtalk-2.0.0]# cat config-message.yml

#timeout: 5s
#no_builtin_template: true
templates:
 - /opt/jiankong/webhook-dingtalk-2.0.0/contrib/templates/alertmanager-dingtalk-message.tmpl
 - /opt/jiankong/webhook-dingtalk-2.0.0/contrib/templates/blackbox-icmp-message.tmpl
targets:
 webhook1:
 url: https://oapi.dingtalk.com/robot/send?access_token=28b5xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
 secret: SEC2bc3fb99xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxecbbd1273d1bb6
 mention:
 mobiles: ['173xxxxxx56']
 message:
 text: |
 {{ template "linux.message" . }} #这里选择模版名字
 @173xxxxxx56
 webhook2:
 url: https://oapi.dingtalk.com/robot/send?access_token=dad50cxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
 secret: SEC4da8xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxe64ffbdbca044cfe1d678
 mention:
 mobiles: ['173xxxxxx56']
 message:
 text: |
 {{ template "email.to.message" . }} #这里选择模版名字
 @173xxxxxx56

3、Prometheus-webhook-dingtalk 模版配置:

[root@localhost webhook-dingtalk-2.0.0]# cat contrib/templates/alertmanager-dingtalk-message.tmpl
{{ define "linux.message" }}

{{- if gt (len .Alerts.Firing) 0 -}}
{{- range $index, $alert := .Alerts -}}

== = **linux虚拟机告警** = ==

**告警程序:** Alertmanager
**告警类型:** {{ $alert.Labels.alertname }}
**故障主机:** {{ $alert.Labels.instance }}
**告警详情:** <font color=#ff0000> {{ $alert.Annotations.message }}{{ $alert.Annotations.description}}</font>
**故障时间:** {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
**告警状态:** <font color=#ff0000> {{ .Status }}</font>

== = **end** = ==
{{- end }}
{{- end }}


{{- if gt (len .Alerts.Resolved) 0 -}}
{{- range $index, $alert := .Alerts -}}

==== = **linux虚拟机告警恢复** = ====
**告警程序:** Alertmanager
**告警类型:** {{ $alert.Labels.alertname }}
**故障主机:** {{ $alert.Labels.instance }}
**告警详情:** <font color=#00ff00>{{ $alert.Annotations.message }}{{ $alert.Annotations.description}} </font>
**故障时间:** {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
**恢复时间:** {{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
**告警状态:** <font color=#00ff00> {{ .Status }} </font>

======= = **end** = =======
{{- end }}
{{- end }}
{{- end }}

4、Prometheus-webhook-dingtalk 重启脚本:

[root@localhost webhook-dingtalk-2.0.0]# cat restart.sh
#!/bin/bash
pidnum=`ps aux|grep prometheus-webhook-dingtalk |grep -v grep|awk -F ' ' '{print $2}'`
kill -9 ${pidnum}
nohup /opt/jiankong/webhook-dingtalk-2.0.0/prometheus-webhook-dingtalk --web.listen-address=:8060 --web.enable-ui --config.file=/opt/jiankong/webhook-dingtalk-2.0.0/config-message.yml &

----------------------------以下无正文-------------------------

如果大家有运维技术问题,可扫描下方二维码进QQ群,一起沟通交流,提升技术。

alertmanager+prometheus-webhook-dingtalk推送至钉钉群_alertmanager