因为使用阿里云的kubernetes托管版服务,需部署一套监控报警系统来监控整个系统的运行状态,当有pod资源使用率过高,或者节点资源不够的情况下,希望能有一个报警系统提示我们这些信息,而prometheus正是适于监控kubernetes集群。采用阿里云ack-prometheus-operator,来部署一套监控kubernetes集群内资源及集群外的资源。
组件在kubernetes内部署prometheus+alermanager+grafana+企业微信报警。
创建企业微信应用参考https://blog.51cto.com/2939281/2654424
- 安装
在阿里云打开容器服务-Kubernetes,找到市场目录下的应用目录,进入ack-prometheus-operator,然后在已有集群中创建ack-prometheus-operator。
2. 修改参数进入发布页面,可以看到刚刚创建的ack-prometheus-operator,点击更新,修改里面的参数。
修改alertmanager参数(注:to_tag是用企业微信的接口创建的一个标签,相当于组,一个组可以关联多个成员)
alertmanager: config: global: resolve_timeout: 5m route: group_by: ['job'] group_wait: 1m group_interval: 1m repeat_interval: 2m receiver: "wechat" #修改接收为微信 routes: - match: alertname: Watchdog receiver: "wechat" receivers: - name: "wechat" #定义微信接收 wechat_configs: - corp_id: "123456789" #企业微信的ID to_party: "12334" #部门的ID to_tag: "1" #用企业微信接口创建的tag agent_id: "1111" #应用的ID api_secret: "11111" #应用的secret send_resolved: true
templates: #模板路径 - '*.tmpl' templateFiles: #报警模板 wechat.tmpl: |- #定义wechat.tmpl的报警模板 {{ define "wechat.default.message" }} {{ range .Alerts }} ========start========== 告警程序: prometheus_alert 告警级别: {{ .Labels.severity }} 告警类型: {{ .Labels.alertname }} 故障主机: {{ .Labels.instance }} 告警主题: {{ .Annotations.summary }} 告警详情: {{ .Annotations.description }} 触发时间: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} ========end========== {{ end }} {{ end }}
#alertermanager存储配置
storage: volumeClaimTemplate: spec: storageClassName: alicloud-disk-ssd accessModes: ["ReadWriteOnce"] resources: requests: storage: 20Gi selector: {}
修改grafana参数:
grafana: enabled: true defaultDashboardsEnabled: true adminPassword: 123456 #设置grafana管理员密码
修改prometheus参数:
prometheus: storageSpec: #添加prometheus存储 volumeClaimTemplate: spec: storageClassName: alicloud-disk-ssd accessModes: ["ReadWriteOnce"] resources: requests: storage: 50Gi selector: {}3. 配置ingress访问
添加grafana.xx.com的igresss,backend指向ack-prometheus-operator-grafana
添加alertmanager.xx.com的igresss,backend指向ack-prometheus-operator-alertmanager
添加ps.xx.com的igresss,backend指向ack-prometheus-operator-prometheus
4. 添加监控任务在保密字典选项里,找到ack-prometheus-operator-prometheus-scrape-confg,配置里面的内容。
additional-scrape-configs.yaml
- job_name: mysql #监控mysql任务 static_configs: - labels: instance: mysql targets: ['192.168.1.10:9104'] - job_name: redis #监控redis任务 static_configs: - labels: instance: redis targets: - 192.168.1.11:9121 - job_name: web_status #监控http请求时间及状态 metrics_path: /probe params: module: - http_2xx relabel_configs: - source_labels: - __address__ target_label: __param_target - replacement: 192.168.1.15:9115 #安装好blockbox的服务器 target_label: __address__ static_configs: - labels: group: web instance: web_status targets: - https://192.168.1.109 - honor_timestamps: true job_name: 192.168.1.200 #监控服务器状态任务 metrics_path: /metrics scheme: http scrape_interval: 30s scrape_timeout: 10s static_configs: - targets: - 192.168.1.200:9100
规则可以在创建时修改additionalPrometheusRules参数配置,也可以通过configmap配置。ack-prometheus-operator默认安装有很多rules。
可以添加自定义的rules,例:添加mysql的rules
使用kubectl -n monitoring get prometheusrules可以看到目前有哪些rules,
kubectl -n monitoring get prometheusrules ack-prometheus-operator-alertmanager.rules -o yaml > ack-prometheus-operator-mysql.rules来创建模板,修改里面的内容。
cat ack-prometheus-operator-mysql.rules
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: generation: 1 labels: app: ack-prometheus-operator chart: ack-prometheus-operator-5.7.0 heritage: Tiller release: ack-prometheus-operator name: ack-prometheus-operator-mysql.rules namespace: monitoring spec: groups: - name: MySQLStatsAlert rules: - alert: MySQL is down expr: mysql_up == 0 for: 5m labels: severity: critical annotations: summary: "Instance {{ $labels.instance }} MySQL is down" description: "MySQL database is down. This requires immediate action!" - alert: open files high expr: mysql_global_status_innodb_num_open_files > (mysql_global_variables_open_files_limit) * 0.75 for: 5m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} open files high" description: "Open files is high. Please consider increasing open_files_limit." - alert: Read buffer size is bigger than max. allowed packet size expr: mysql_global_variables_read_buffer_size > mysql_global_variables_slave_max_allowed_packet for: 5m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} Read buffer size is bigger than max. allowed packet size" description: "Read buffer size (read_buffer_size) is bigger than max. allowed packet size (max_allowed_packet).This can break your replication." - alert: Sort buffer possibly missconfigured expr: mysql_global_variables_innodb_sort_buffer_size <256*1024 or mysql_global_variables_read_buffer_size > 4*1024*1024 for: 5m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} Sort buffer possibly missconfigured" description: "Sort buffer size is either too big or too small. A good value for sort_buffer_size is between 256k and 4M." - alert: Thread stack size is too small expr: mysql_global_variables_thread_stack <196608 for: 5m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} Thread stack size is too small" description: "Thread stack size is too small. This can cause problems when you use Stored Language constructs for example. A typical is 256k for thread_stack_size." - alert: Used more than 80% of max connections limited expr: mysql_global_status_max_used_connections > mysql_global_variables_max_connections * 0.8 for: 5m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} Used more than 80% of max connections limited" description: "Used more than 80% of max connections limited" - alert: InnoDB Force Recovery is enabled expr: mysql_global_variables_innodb_force_recovery != 0 for: 5m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} InnoDB Force Recovery is enabled" description: "InnoDB Force Recovery is enabled. This mode should be used for data recovery purposes only. It prohibits writing to the data." - alert: InnoDB Log File size is too small expr: mysql_global_variables_innodb_log_file_size < 16777216 for: 5m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} InnoDB Log File size is too small" description: "The InnoDB Log File size is possibly too small. Choosing a small InnoDB Log File size can have significant performance impacts." - alert: InnoDB Flush Log at Transaction Commit expr: mysql_global_variables_innodb_flush_log_at_trx_commit != 1 for: 5m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} InnoDB Flush Log at Transaction Commit" description: "InnoDB Flush Log at Transaction Commit is set to a values != 1. This can lead to a loss of commited transactions in case of a power failure." - alert: Table definition cache too small expr: mysql_global_status_open_table_definitions > mysql_global_variables_table_definition_cache for: 5m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} Table definition cache too small" description: "Your Table Definition Cache is possibly too small. If it is much too small this can have significant performance impacts!" - alert: Table open cache too small expr: mysql_global_status_open_tables >mysql_global_variables_table_open_cache * 99/100 for: 5m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} Table open cache too small" description: "Your Table Open Cache is possibly too small (old name Table Cache). If it is much too small this can have significant performance impacts!" - alert: Thread stack size is possibly too small expr: mysql_global_variables_thread_stack < 262144 for: 5m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} Thread stack size is possibly too small" description: "Thread stack size is possibly too small. This can cause problems when you use Stored Language constructs for example. A typical is 256k for thread_stack_size." - alert: InnoDB Buffer Pool Instances is too small expr: mysql_global_variables_innodb_buffer_pool_instances == 1 for: 5m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} InnoDB Buffer Pool Instances is too small" description: "If you are using MySQL 5.5 and higher you should use several InnoDB Buffer Pool Instances for performance reasons. Some rules are: InnoDB Buffer Pool Instance should be at least 1 Gbyte in size. InnoDB Buffer Pool Instances you can set equal to the number of cores of your machine." - alert: InnoDB Plugin is enabled expr: mysql_global_variables_ignore_builtin_innodb == 1 for: 5m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} InnoDB Plugin is enabled" description: "InnoDB Plugin is enabled" - alert: Binary Log is disabled expr: mysql_global_variables_log_bin != 1 for: 5m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} Binary Log is disabled" description: "Binary Log is disabled. This prohibits you to do Point in Time Recovery (PiTR)." - alert: Binlog Cache size too small expr: mysql_global_variables_binlog_cache_size < 1048576 for: 5m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} Binlog Cache size too small" description: "Binlog Cache size is possibly to small. A value of 1 Mbyte or higher is OK." - alert: Binlog Statement Cache size too small expr: mysql_global_variables_binlog_stmt_cache_size <1048576 and mysql_global_variables_binlog_stmt_cache_size > 0 for: 5m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} Binlog Statement Cache size too small" description: "Binlog Statement Cache size is possibly to small. A value of 1 Mbyte or higher is typically OK." - alert: Binlog Transaction Cache size too small expr: mysql_global_variables_binlog_cache_size <1048576 for: 5m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} Binlog Transaction Cache size too small" description: "Binlog Transaction Cache size is possibly to small. A value of 1 Mbyte or higher is typically OK." - alert: Sync Binlog is enabled expr: mysql_global_variables_sync_binlog == 1 for: 5m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} Sync Binlog is enabled" description: "Sync Binlog is enabled. This leads to higher data security but on the cost of write performance."
最后应用这个yaml文件kubectl -n monitoring apply -f ack-prometheus-operator-mysql.rules
3. grafan配置登陆,导入所需要的dashbord。进入https://grafana.com/grafana/dashboards,搜索所需要的dashboard,如mysql,然后复制id,然后在grafan里import这个dashboard的id。
6. 踩坑在添加企业微信报警模板后,一直不生效,发现alertmanager容器日志一直报 "wechat.tmpl" not defined"的错误,说明找不到模板,在alertmanager的参数里添加templates的参数,然后重新发布。发布后也是报"wechat.tmpl" not defined"的错误,说明配置未生效。
后来看到alertmanager的yaml文件,发现配置文件是采用Secret key 映射到 volume /etc/alertmanager/config目录下,在对应的Secret key中发现没有templates的配置,于是在secret中添加templates的配置,提交后等个几分钟,发现配置生效了。