背景
目前有RabbitMQ 异常导致业务中断的问题,所有有监控 RabbitMQ 的需求。在此记录配置监控 RabbitMQ 的过程。
查找文档
通过 Prometheus 文档找到 Exporter 可以对第三方服务导出为 Prometheus 指标。滚动页面找到 RabbitMQ Exporter。根据此项目的 README 文件,可以使用 RabbitMQ官网插件。
开启 RabbitMQ_Prometheus 插件
如果 rabbitmq_prometheus 已经开启可以跳过这一步。
rabbitmq-plugins enable rabbitmq_prometheus
查看RabbitMQ Prometheus 指标
具体指标的查看 RabbitMQ 指标 官网详细介绍。
http://localhost:15692/metrics
配置 Prometheus Service Monitor
部署RabbitMQ Service Monitor
kubectl apply -f rabbitmq-svcmonitor.yaml
部署此文件。
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: rabbitmq-exporter
namespace: monitoring
labels:
k8s-app: rabbitmq-exporter
spec:
endpoints:
- port: metrics # 与监控服务SVC的端口号对应
interval: 15s # 获取指标的频率
selector:
matchLabels:
app.porometheus: loshu-lite-rabbitmq-ha # 通过标签选择SVC
jobLabel: k8s-app # 用于从中检索任务名称的标签
namespaceSelector:
matchNames: # 指定目标所在的命名空间,也可以设置
- my-simulation
$ kubectl get servicemonitor -n monitoring rabbitmq-exporter
NAME AGE
rabbitmq-exporter 4m41s
查看 Prometheus Status Targets
说明已经成功监听到 RabbitMQ Exporter 了。
配置 Grafana Dashboard 页面
Grafana Dashboard 是开源的,来自 Rabbitmq-Server Github 代码库。将 json 文件或者Dashboard ID 导入到 Grafana 中。
- 选择数据源和Folder
查看 Dashboard 页面
可能是由于版本的问题,有很多查询不到数据的情况。可以根据 RabbitMQ 指标结合自己的情况自定义dashboard。
监控指标确认
此问题复现之后,观察实际情况,进而确认相关监控指标,根据指标配置监控告警。
问题复现
目前此问题并没有找到自愈的方法,只能通过告警之后及时删除 Virtual host /
之后重新创建才可以修复。通过接收到告警信息手动处理修复此问题减少影响范围。
报错日志:
console上显示错误信息为:
Virtual host / experienced an error on node rabbit@loshu-log-rabbitmq-ha-0.loshu-log-rabbitmq-ha-discovery.kube-public.svc.cluster.local and may be inaccessible
metrics 定位
根据上述错误信息定位到相关联的 metrics 指标。
打开的连接总数
# TYPE rabbitmq_connections_opened_total counter
# HELP rabbitmq_connections_opened_total Total number of connections opened
rabbitmq_connections_opened_total 0
打开的channels总数
# TYPE rabbitmq_channels_opened_total counter
# HELP rabbitmq_channels_opened_total Total number of channels opened
rabbitmq_channels_opened_total 0
关闭的channels总数
# TYPE rabbitmq_channels_closed_total counter
# HELP rabbitmq_channels_closed_total Total number of channels closed
rabbitmq_channels_closed_total 0
声明的queues总数
# TYPE rabbitmq_queues_declared_total counter
# HELP rabbitmq_queues_declared_total Total number of queues declared
rabbitmq_queues_declared_total 0
打开 TCP sockets
# TYPE rabbitmq_process_open_tcp_sockets gauge
# HELP rabbitmq_process_open_tcp_sockets Open TCP sockets
rabbitmq_process_open_tcp_sockets 0
连接的消费者
# TYPE rabbitmq_consumers gauge
# HELP rabbitmq_consumers Consumers currently connected
rabbitmq_consumers 0
当前打开的连接
# TYPE rabbitmq_connections gauge
# HELP rabbitmq_connections Connections currently open
rabbitmq_connections 0
目前开放的 Channels
# TYPE rabbitmq_channels gauge
# HELP rabbitmq_channels Channels currently open
rabbitmq_channels 0
可用Queues
# TYPE rabbitmq_queues gauge
# HELP rabbitmq_queues Queues available
rabbitmq_queues 0
确定 metrics 指标
可以根据上述指标。当 消费者 Consumers,通道 Channels,队列 Queues 为0的时候就可以告警了。
连接的消费者
# TYPE rabbitmq_consumers gauge
# HELP rabbitmq_consumers Consumers currently connected
rabbitmq_consumers 0
目前开放的 Channels
# TYPE rabbitmq_channels gauge
# HELP rabbitmq_channels Channels currently open
rabbitmq_channels 0
可用 Queues
# TYPE rabbitmq_queues gauge
# HELP rabbitmq_queues Queues available
rabbitmq_queues 0
配置监控告警
groups:
- name: RabbitMQAlerts
rules:
- alert: RabbitmqConsumersUnavailable
annotations:
message: The consumer of the rabbitmq connection is 0 for {{ $labels.namespace }}/{{ $labels.service }}.
expr: |
rabbitmq_consumers == 0
for: 1m
labels:
severity: warning
- alert: RabbitmqChannelsUnavailable
annotations:
message: The currently open Channels of rabbitmq is 0 for {{ $labels.namespace }}/{{ $labels.service }}.
expr: |
rabbitmq_channels == 0
for: 1m
labels:
severity: warning
- alert: RabbitmqQueuesUnavailable
annotations:
message: The available Queues of rabbitmq is 0 for {{ $labels.namespace }}/{{ $labels.service }}.
expr: |
rabbitmq_queues == 0
for: 1m
labels:
severity: warning
- alert: RabbitmqVirtualHostUnavailable
annotations:
message: Rabbitmq Virtual Host Unavailable for {{ $labels.namespace }}/{{ $labels.service }}.
expr: |
sum(rabbitmq_consumers + rabbitmq_channels + rabbitmq_queues) by (namespace, service) == 0
for: 1m
labels:
severity: critical