目的

因为使用阿里云的kubernetes托管版服务,需部署一套监控报警系统来监控整个系统的运行状态,当有pod资源使用率过高,或者节点资源不够的情况下,希望能有一个报警系统提示我们这些信息,而prometheus正是适于监控kubernetes集群。采用阿里云ack-prometheus-operator,来部署一套监控kubernetes集群内资源及集群外的资源。

组件

在kubernetes内部署prometheus+alermanager+grafana+企业微信报警。

创建企业微信应用

参考https://blog.51cto.com/2939281/2654424


部署ack-prometheus-operator
  1. 安装

在阿里云打开容器服务-Kubernetes,找到市场目录下的应用目录,进入ack-prometheus-operator,然后在已有集群中创建ack-prometheus-operator。

部署ack-prometheus-operator+企业微信告警_prometheus部署ack-prometheus-operator+企业微信告警_k8s_02部署ack-prometheus-operator+企业微信告警_k8s_03

2. 修改参数

进入发布页面,可以看到刚刚创建的ack-prometheus-operator,点击更新,修改里面的参数。

修改alertmanager参数(注:to_tag是用企业微信的接口创建的一个标签,相当于组,一个组可以关联多个成员)

alertmanager:
config:
  global:
resolve_timeout: 5m
  route:
  group_by: ['job']
  group_wait: 1m
  group_interval: 1m
  repeat_interval: 2m
  receiver: "wechat"     #修改接收为微信
  routes:
    - match:
      alertname: Watchdog
      receiver: "wechat"
receivers:
- name: "wechat"    #定义微信接收
  wechat_configs:
    - corp_id: "123456789"  #企业微信的ID
      to_party: "12334"        #部门的ID
      to_tag: "1"							#用企业微信接口创建的tag
      agent_id: "1111"  #应用的ID
      api_secret: "11111"  #应用的secret
      send_resolved: true

部署ack-prometheus-operator+企业微信告警_k8s_04


templates:      #模板路径
        - '*.tmpl'
templateFiles:             #报警模板
  wechat.tmpl: |-          #定义wechat.tmpl的报警模板
    {{ define "wechat.default.message" }}
    {{ range .Alerts }}
    ========start==========
    告警程序: prometheus_alert
    告警级别: {{ .Labels.severity }}
    告警类型: {{ .Labels.alertname }}
    故障主机: {{ .Labels.instance }}
    告警主题: {{ .Annotations.summary }}
    告警详情: {{ .Annotations.description }}
    触发时间: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
    ========end==========
  {{ end }}
  {{ end }}

部署ack-prometheus-operator+企业微信告警_k8s_05


#alertermanager存储配置

storage:
  volumeClaimTemplate:
    spec:
      storageClassName: alicloud-disk-ssd    
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 20Gi
  selector: {}

   

部署ack-prometheus-operator+企业微信告警_k8s_06

修改grafana参数:

grafana:
  enabled: true
    defaultDashboardsEnabled: true
    adminPassword: 123456   #设置grafana管理员密码

 部署ack-prometheus-operator+企业微信告警_k8s_07


修改prometheus参数:


prometheus:
  storageSpec:    #添加prometheus存储
    volumeClaimTemplate:
      spec:
        storageClassName: alicloud-disk-ssd
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 50Gi
  selector: {}

部署ack-prometheus-operator+企业微信告警_k8s_08

3. 配置ingress访问

添加grafana.xx.com的igresss,backend指向ack-prometheus-operator-grafana

添加alertmanager.xx.com的igresss,backend指向ack-prometheus-operator-alertmanager

添加ps.xx.com的igresss,backend指向ack-prometheus-operator-prometheus

4. 添加监控任务

在保密字典选项里,找到ack-prometheus-operator-prometheus-scrape-confg,配置里面的内容。

additional-scrape-configs.yaml


- job_name: mysql     #监控mysql任务
  static_configs:
  - labels:
    instance: mysql
    targets: ['192.168.1.10:9104']                
  - job_name: redis    #监控redis任务
  static_configs:
  - labels:
      instance: redis
    targets:
    - 192.168.1.11:9121
- job_name: web_status    #监控http请求时间及状态
  metrics_path: /probe
  params:
    module:
    - http_2xx
  relabel_configs:
  - source_labels:
    - __address__
    target_label: __param_target
  - replacement: 192.168.1.15:9115    #安装好blockbox的服务器
    target_label: __address__
  static_configs:
  - labels:
      group: web
      instance: web_status
    targets:
    - https://192.168.1.109
- honor_timestamps: true
  job_name: 192.168.1.200        #监控服务器状态任务
  metrics_path: /metrics
  scheme: http
  scrape_interval: 30s
  scrape_timeout: 10s
  static_configs:
  - targets:
    - 192.168.1.200:9100

部署ack-prometheus-operator+企业微信告警_k8s_09


5. 配置报警规则

规则可以在创建时修改additionalPrometheusRules参数配置,也可以通过configmap配置。ack-prometheus-operator默认安装有很多rules。

可以添加自定义的rules,例:添加mysql的rules

使用kubectl -n monitoring get prometheusrules可以看到目前有哪些rules,

kubectl -n monitoring get prometheusrules ack-prometheus-operator-alertmanager.rules -o yaml > ack-prometheus-operator-mysql.rules来创建模板,修改里面的内容。

cat ack-prometheus-operator-mysql.rules

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
 generation: 1
 labels:
   app: ack-prometheus-operator
   chart: ack-prometheus-operator-5.7.0
   heritage: Tiller
   release: ack-prometheus-operator
 name: ack-prometheus-operator-mysql.rules
 namespace: monitoring
spec:
 groups:
 - name: MySQLStatsAlert
   rules:
   - alert: MySQL is down
     expr: mysql_up == 0
     for: 5m
     labels:
       severity: critical
     annotations:
       summary: "Instance {{ $labels.instance }} MySQL is down"
       description: "MySQL database is down. This requires immediate action!"
   - alert: open files high
     expr: mysql_global_status_innodb_num_open_files > (mysql_global_variables_open_files_limit) * 0.75
     for: 5m
     labels:
       severity: warning
     annotations:
       summary: "Instance {{ $labels.instance }} open files high"
       description: "Open files is high. Please consider increasing open_files_limit."
   - alert: Read buffer size is bigger than max. allowed packet size
     expr: mysql_global_variables_read_buffer_size > mysql_global_variables_slave_max_allowed_packet  
     for: 5m
     labels:
       severity: warning
     annotations:
       summary: "Instance {{ $labels.instance }} Read buffer size is bigger than max. allowed packet size"
       description: "Read buffer size (read_buffer_size) is bigger than max. allowed packet size (max_allowed_packet).This can break your replication."
   - alert: Sort buffer possibly missconfigured
     expr: mysql_global_variables_innodb_sort_buffer_size <256*1024 or mysql_global_variables_read_buffer_size > 4*1024*1024  
     for: 5m
     labels:
       severity: warning
     annotations:
       summary: "Instance {{ $labels.instance }} Sort buffer possibly missconfigured"
       description: "Sort buffer size is either too big or too small. A good value for sort_buffer_size is between 256k and 4M."
   - alert: Thread stack size is too small
     expr: mysql_global_variables_thread_stack <196608
     for: 5m
     labels:
       severity: warning
     annotations:
       summary: "Instance {{ $labels.instance }} Thread stack size is too small"
       description: "Thread stack size is too small. This can cause problems when you use Stored Language constructs for example. A typical is 256k for thread_stack_size."
   - alert: Used more than 80% of max connections limited  
     expr: mysql_global_status_max_used_connections > mysql_global_variables_max_connections * 0.8
     for: 5m
     labels:
       severity: warning
     annotations:
       summary: "Instance {{ $labels.instance }} Used more than 80% of max connections limited"
       description: "Used more than 80% of max connections limited"
   - alert: InnoDB Force Recovery is enabled
     expr: mysql_global_variables_innodb_force_recovery != 0  
     for: 5m
     labels:
       severity: warning
     annotations:
       summary: "Instance {{ $labels.instance }} InnoDB Force Recovery is enabled"
       description: "InnoDB Force Recovery is enabled. This mode should be used for data recovery purposes only. It prohibits writing to the data."
   - alert: InnoDB Log File size is too small
     expr: mysql_global_variables_innodb_log_file_size < 16777216  
     for: 5m
     labels:
       severity: warning
     annotations:
       summary: "Instance {{ $labels.instance }} InnoDB Log File size is too small"
       description: "The InnoDB Log File size is possibly too small. Choosing a small InnoDB Log File size can have significant performance impacts."
   - alert: InnoDB Flush Log at Transaction Commit
     expr: mysql_global_variables_innodb_flush_log_at_trx_commit != 1
     for: 5m
     labels:
       severity: warning
     annotations:
       summary: "Instance {{ $labels.instance }} InnoDB Flush Log at Transaction Commit"
       description: "InnoDB Flush Log at Transaction Commit is set to a values != 1. This can lead to a loss of commited transactions in case of a power failure."
   - alert: Table definition cache too small
     expr: mysql_global_status_open_table_definitions > mysql_global_variables_table_definition_cache
     for: 5m
     labels:
       severity: warning
     annotations:
       summary: "Instance {{ $labels.instance }} Table definition cache too small"
       description: "Your Table Definition Cache is possibly too small. If it is much too small this can have significant performance impacts!"
   - alert: Table open cache too small
     expr: mysql_global_status_open_tables >mysql_global_variables_table_open_cache * 99/100
     for: 5m
     labels:
       severity: warning
     annotations:
       summary: "Instance {{ $labels.instance }} Table open cache too small"
       description: "Your Table Open Cache is possibly too small (old name Table Cache). If it is much too small this can have significant performance impacts!"
   - alert: Thread stack size is possibly too small
     expr: mysql_global_variables_thread_stack < 262144
     for: 5m
     labels:
       severity: warning
     annotations:
       summary: "Instance {{ $labels.instance }} Thread stack size is possibly too small"
       description: "Thread stack size is possibly too small. This can cause problems when you use Stored Language constructs for example. A typical is 256k for thread_stack_size."
   - alert: InnoDB Buffer Pool Instances is too small
     expr: mysql_global_variables_innodb_buffer_pool_instances == 1
     for: 5m
     labels:
       severity: warning
     annotations:
       summary: "Instance {{ $labels.instance }} InnoDB Buffer Pool Instances is too small"
       description: "If you are using MySQL 5.5 and higher you should use several InnoDB Buffer Pool Instances for performance reasons. Some rules are: InnoDB Buffer Pool Instance should be at least 1 Gbyte in size. InnoDB Buffer Pool Instances you can set equal to the number of cores of your machine."
   - alert: InnoDB Plugin is enabled
     expr: mysql_global_variables_ignore_builtin_innodb == 1
     for: 5m
     labels:
       severity: warning
     annotations:
       summary: "Instance {{ $labels.instance }} InnoDB Plugin is enabled"
       description: "InnoDB Plugin is enabled"
   - alert: Binary Log is disabled
     expr: mysql_global_variables_log_bin != 1
     for: 5m
     labels:
       severity: warning
     annotations:
       summary: "Instance {{ $labels.instance }} Binary Log is disabled"
       description: "Binary Log is disabled. This prohibits you to do Point in Time Recovery (PiTR)."
   - alert: Binlog Cache size too small
     expr: mysql_global_variables_binlog_cache_size < 1048576
     for: 5m
     labels:
       severity: warning
     annotations:
       summary: "Instance {{ $labels.instance }} Binlog Cache size too small"
       description: "Binlog Cache size is possibly to small. A value of 1 Mbyte or higher is OK."
   - alert: Binlog Statement Cache size too small
     expr: mysql_global_variables_binlog_stmt_cache_size <1048576 and mysql_global_variables_binlog_stmt_cache_size > 0
     for: 5m
     labels:
       severity: warning
     annotations:
       summary: "Instance {{ $labels.instance }} Binlog Statement Cache size too small"
       description: "Binlog Statement Cache size is possibly to small. A value of 1 Mbyte or higher is typically OK."
   - alert: Binlog Transaction Cache size too small
     expr: mysql_global_variables_binlog_cache_size  <1048576
     for: 5m
     labels:
       severity: warning
     annotations:
       summary: "Instance {{ $labels.instance }} Binlog Transaction Cache size too small"
       description: "Binlog Transaction Cache size is possibly to small. A value of 1 Mbyte or higher is typically OK."
   - alert: Sync Binlog is enabled
     expr: mysql_global_variables_sync_binlog == 1
     for: 5m
     labels:
       severity: warning
     annotations:
       summary: "Instance {{ $labels.instance }} Sync Binlog is enabled"
       description: "Sync Binlog is enabled. This leads to higher data security but on the cost of write performance."


最后应用这个yaml文件kubectl -n monitoring apply -f ack-prometheus-operator-mysql.rules

3. grafan配置

登陆,导入所需要的dashbord。进入https://grafana.com/grafana/dashboards,搜索所需要的dashboard,如mysql,然后复制id,然后在grafan里import这个dashboard的id。

部署ack-prometheus-operator+企业微信告警_prometheus_10部署ack-prometheus-operator+企业微信告警_prometheus_11部署ack-prometheus-operator+企业微信告警_k8s_12部署ack-prometheus-operator+企业微信告警_k8s_13

6. 踩坑    

在添加企业微信报警模板后,一直不生效,发现alertmanager容器日志一直报 "wechat.tmpl" not defined"的错误,说明找不到模板,在alertmanager的参数里添加templates的参数,部署ack-prometheus-operator+企业微信告警_k8s_14然后重新发布。发布后也是报"wechat.tmpl" not defined"的错误,说明配置未生效。

后来看到alertmanager的yaml文件,发现配置文件是采用Secret key 映射到 volume /etc/alertmanager/config目录下,在对应的Secret key中发现没有templates的配置,于是在secret中添加templates的配置,提交后等个几分钟,发现配置生效了。

部署ack-prometheus-operator+企业微信告警_k8s_15