1.官网资料

https://github.com/apache/rocketmq-exporter

官方模板

Grafana Dashboard ID: 10477, name: RocketMQ Exporter Overview. For details of the dashboard please see RocketMQ Exporter Overview.

2.常用指标

类型

监控项

说明

Broker

rocketmq_broker_tps

单个broker每秒生成的消息数

rocketmq_broker_qps

单个broker的qps(每秒请求处理数)

Producer

rocketmq_producer_tps

单个topic的消息生产的(TPS生产tps)

rocketmq_producer_message_size

单个topic每秒消息生产的总数据量大小

rocketmq_producer_offset

单个topic消息生产的offset

Consumer Groups

rocketmq_consumer_tps

单个consumer组每秒消息的TPS(消费tps)

rocketmq_consumer_message_size

单个consumer组每秒消息消息的总数据大小

rocketmq_consumer_offset

单个consumer组消息的offset

rocketmq_group_get_latency_by_storetime

单个消费组延迟时间

rocketmq_group_get_latency

单个队列的某个主题的消费者延迟

rocketmq_message_accumulation

单个消费组延迟消费消息数量

Consumer

rocketmq_client_consume_fail_msg_count

消费者一小时内消费消息失败的数量

rocketmq_client_consume_fail_msg_tps

消费者每秒消费消息失败的数量

rocketmq_client_consume_ok_msg_tps

消费者每秒消费成功的消息数

rocketmq_client_consumer_pull_tps

消费者每秒消费的消息数


rocketmq_client_consume_rt

每条消息的平均消费时间


rocketmq_client_consumer_pull_rt

拉取每条消息的平均时间


rocketmq_client_consumer_pull_tps

客户端每秒拉取的消息数

Container

container_cpu_usage_seconds_total

容器CPU使用率

container_memory_usage_bytes

当前使用的内存量

container_fs_usage_bytes

容器磁盘空间使用

container_fs_writes_bytes_total

磁盘写入速度

container_fs_reads_bytes_total

磁盘读取速度

rocketmq_brokeruntime

rocketmq_brokeruntime_commitlog_disk_ratio

rocketmq_brokeruntime_consumequeue_disk_ratio

rocketmq_brokeruntime_commitlogdir_capacity_free

rocketmq_brokeruntime_commitlogdir_capacity_total

rocketmq_brokeruntime_commitlog_maxoffset

rocketmq_brokeruntime_commitlog_minoffset

下面这个是阿里云官方配置的时候自带的,阿里云专用


指标


类别

描述





rocketmq_broker_tps


Broker

Broker produces the number of messages per second

rocketmq_broker_qps

Broker

Broker consumes messages per second

rocketmq_producer_tps

Producer

The number of messages produced per second per topic

rocketmq_producer_message_size

Producer

The size of a message produced per second by a topic (in bytes)

rocketmq_producer_offset

Producer

The progress of a topic's production message

rocketmq_consumer_tps

Consumer Groups

The number of messages consumed per second by a consumer group

rocketmq_consumer_message_size

Consumer Groups

The size of the message consumed by the consumer group per second (in bytes)

rocketmq_consumer_offset

Consumer Groups

Progress of consumption message for a consumer group

rocketmq_group_get_latency

Consumer Groups

Consumer latency on some topic for one queue

rocketmq_group_get_latency_by_storetime

Consumer Groups

Consumption delay time of a consumer group

rocketmq_message_accumulation

Consumer Groups

How far Consumer offset lag behind

rocketmq_client_consume_fail_msg_count

Consumer

The number of messages consumed fail in one hour

rocketmq_client_consume_fail_msg_tps

Consumer

The number of messages consumed fail per second

rocketmq_client_consume_ok_msg_tps

Consumer

The number of messages consumed success per second

rocketmq_client_consume_ok_msg_tps

Consumer

The number of messages consumed success per second

rocketmq_client_consume_rt

Consumer

The average time of consuming every message

rocketmq_client_consumer_pull_rt

Consumer

The average time of pulling every message

rocketmq_client_consumer_pull_tps

Consumer

The number of messages pulled by client per second

rocketmq_brokeruntime_pmdt_0to10ms

Broker

The number of put message broke responds within 0to10ms

rocketmq_brokeruntime_pmdt_10to50ms

Broker

The number of put message broke responds within 10to50ms

rocketmq_brokeruntime_pmdt_50to100ms

Broker

The number of put message broke responds within 50to100ms

rocketmq_brokeruntime_pmdt_100to200ms

Broker

The number of put message broke responds within 100to200ms

rocketmq_brokeruntime_pmdt_200to500ms

Broker

The number of put message broke responds within 200to500ms

rocketmq_brokeruntime_pmdt_500to1s

Broker

The number of put message broke responds within 500to1s

rocketmq_brokeruntime_pmdt_1to2s

Broker

The number of put message broke responds within 1to2s

rocketmq_brokeruntime_pmdt_2to3s

Broker

The number of put message broke responds within 2to3s

rocketmq_brokeruntime_pmdt_3to4s

Broker

The number of put message broke responds within 3to4s

rocketmq_brokeruntime_pmdt_4to5s

Broker

The number of put message broke responds within 4to5s

rocketmq_brokeruntime_pmdt_5to10s

Broker

The number of put message broke responds within 5to10s

rocketmq_brokeruntime_pmdt_10stomore

Broker

The number of put message broke responds within 10stomore

rocketmq_brokeruntime_query_threadpoolqueue_headwait_timemills

Broker

Query thread pool queue head element wait time

rocketmq_brokeruntime_pull_threadpoolqueue_headwait_timemill

Broker

Pull thread pool queue head element wait time

rocketmq_brokeruntime_send_threadpoolqueue_headwait_timemills

Broker

Send thread pool queue head element wait time

rocketmq_brokeruntime_commitlog_disk_ratio

Broker

Broker commit log disk ratio

rocketmq_brokeruntime_consumequeue_disk_ratio

Broker

Broker consume queue disk ratio

rocketmq_brokeruntime_commitlogdir_capacity_free

Broker

Broker commit log dir capacity free

rocketmq_brokeruntime_commitlogdir_capacity_total

Broker

Broker commit log dir capacity total

rocketmq_brokeruntime_commitlog_maxoffset

Broker

Broker commit log max offset

rocketmq_brokeruntime_msg_put_total_today_now

Broker

Broker msg put total today now

rocketmq_brokeruntime_msg_gettotal_today_now

Broker

Broker msg get total today now

rocketmq_brokeruntime_dispatch_behind_bytes

Broker

Broker dispatch behind bytes

rocketmq_brokeruntime_put_message_size_total

Broker

Broker put message size total

rocketmq_brokeruntime_put_message_average_size

Broker

Broker put message average size

rocketmq_brokeruntime_msg_gettotal_yesterdaymorning

Broker

Broker msg get total yesterday morning

rocketmq_brokeruntime_msg_gettotal_todaymorning

Broker

Broker msg get total today morning


更多指标参考官方源码:

https://github.com/apache/rocketmq-exporter/blob/master/rocketmq_exporter_overview.json

这里有人做了一版更实用的大盘,下载json即可:

告警规则主要有如下几条:

  • broker节点挂了
  • 磁盘空间不足
  • broker busy告警
  • 消息积压
  • broker写入消息耗时太久(可能是broker IO压力大或才内存资源不足了)
  • 集群发送的tps暴增了(可能是有人突然对集群压测,或者猛写入一批消息了)

 下面是阿里云的一些实用问题。

阿里云需要在Prometheus监控增加RocketMQ的监控组件

kibana建监控大盘 监控大盘图表_ci

 需要注意如果是阿里云部署的私有云之类需要注意镜像是否推送。

配置好默认会自带一个监控大盘实际上也就是官方做的模板,只是用了阿里云的环境变量。

kibana建监控大盘 监控大盘图表_kibana建监控大盘_02