kafka使用也很久了,如何细粒度的监控kafka,目前还找不到一款比较合适的开源监控工具,但是不妨碍总结一下如何监控kafka,最官方的方法就是使用metrics的值来监控kafka,目前我们就是使用jmxtrans来获取metrics值监控kafka的。kafak监控主要分为三个方面:broker监控、consumer监控、producer监控。三者的监控项可以通过jconsole来查看具体的mbean。

    在具体查看mbean之前,我们先了解一下metrics(kafka使用Yammer metrics ),metrics是一个度量工具包,提供多种度量类型来统计程序的各项指标,在JAVA代码中嵌入Metrics代码,可以方便的对业务代码的各个指标进行监控。metrics默认支持并开启了jmx的方式暴露监控数据,开发人员可以使用jmx的方式轻松的获取度量数据,metrics还支持其他类型的reporter,如csv、console,metrics提供了对Ehcache、Apache HttpClient、JDBI、Jersey、Jetty、Log4J、Logback、JVM等的集成,可以方便地将Metrics输出到Ganglia、Graphite中,供用户图形化展示。metrics提供了五种类型的度量类型:

  1.         gauge:是一个最简单的计量,一般用来统计瞬时状态的数据信息,比如系统中处于pending状态的job
  2.         counter:是gauge的一个特例,维护一个计数器,可以通过inc()和dec()方法对计数器做修改。一般用来记录某个事件发生的次数或者请求的个数
  3.         meters:用来度量某个时间段的平均处理次数(request per second)。统计结果有总的请求数,平均每秒的请求数,以及最近的1、5、15分钟的平均TPS。
  4.         histograms:主要使用来统计数据的分布情况,最大值、最小值、平均值、中位数,百分比(75%、90%、95%、98%、99%和99.9%)。例如,需要统计某个页面的请求响应时间分布情况,可以使用该种类型的Metrics进行统计
  5.         timers:主要是用来统计某一块代码段的执行时间以及其分布情况,具体是基于Histograms和Meters来实现的
  6.         health checks:用于对Application、其子模块或者关联模块的运行是否正常做检测。该模块是独立metrics-core模块的,使用时则导入metrics-healthchecks包    


broker监控

    broker metrics

kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions	Number of under-replicated partitions (ISR < all replicas ). Alert if value is greater than 0.
kafka.controller:type=KafkaController,name=OfflinePartitionsCount	Number of partitions that don’t have an active leader and are hence not writable or readable. Alert if value is greater than 0.
kafka.controller:type=KafkaController,name=ActiveControllerCount	Number of active controllers in the cluster. Alert if value is anything other than 1.
kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec	Aggregate incoming message rate.
kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec	Aggregate incoming byte rate.
kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec	Aggregate outgoing byte rate.
kafka.network:type=RequestMetrics,name=RequestsPerSec,request={Produce or FetchConsumer or FetchFollower}	Request rate.
kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs	Log flush rate and time.
kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs	Leader election rate and latency.
kafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec	Unclean leader election rate.
kafka.server:type=ReplicaManager,name=PartitionCount	Number of partitions on this broker. This should be mostly even across all brokers.
kafka.server:type=ReplicaManager,name=LeaderCount	Number of leaders on this broker. This should be mostly even across all brokers. If not, set auto.leader.rebalance.enable to true on all brokers in the cluster.
kafka.server:type=ReplicaManager,name=IsrShrinksPerSec	If a broker goes down, ISR for some of the partitions will shrink. When that broker is up again, ISR will be expanded once the replicas are fully caught up. Other than that, the expected value for both ISR shrink rate and expansion rate is 0.
kafka.server:type=ReplicaManager,name=IsrExpandsPerSec	When a broker is brought up after a failure, it starts catching up by reading from the leader. Once it is caught up, it gets added back to the ISR.
kafka.server:type=ReplicaFetcherManager,name=MaxLag,clientId=Replica	Maximum lag in messages between the follower and leader replicas. This is controlled by the replica.lag.max.messages config.
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=([-.\w]+),topic=([-.\w]+),partition=([0-9]+)	Lag in number of messages per follower replica. This is useful to know if the replica is slow or has stopped replicating from the leader.
kafka.network:type=RequestMetrics,name=TotalTimeMs,request={Produce or FetchConsumer or FetchFollower}	Total time in ms to serve the specified request.
kafka.server:type=ProducerRequestPurgatory,name=PurgatorySize	Number of requests waiting in the producer purgatory. This should be non-zero acks=-1 is used on the producer.
kafka.server:type=FetchRequestPurgatory,name=PurgatorySize	Number of requests waiting in the fetch purgatory. This is high if consumers use a large value for fetch.wait.max.ms .

    生产jmxtrans-agent.xml样例

<jmxtrans-agent>
    <queries>

        <!-- Message in rate -->
        <query objectName="kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec" attributes="MeanRate,OneMinuteRate" resultAlias="Kafka.BrokerTopicMetrics.MessagesInPerSec.#attribute#"/>

        <!-- Byte in rate -->
        <query objectName="kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec" attributes="MeanRate,OneMinuteRate" resultAlias="Kafka.BrokerTopicMetrics.BytesInPerSec.#attribute#"/>

        <!-- Byte out rate -->
        <query objectName="kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec" attributes="MeanRate,OneMinuteRate" resultAlias="Kafka.BrokerTopicMetrics.BytesOutPerSec.#attribute#"/>

        <!-- Log flush rate and time -->
        <query objectName="kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs" attributes="OneMinuteRate" resultAlias="Kafka.LogFlushStats.FlushRateAndTimeMs.#attribute#"/>

        <!-- Request rate -->
        <query objectName="kafka.network:type=RequestMetrics,name=RequestsPerSec,request=Produce" attributes="OneMinuteRate" resultAlias="Kafka.RequestsPerSec.Produce.#attribute#"/>
        <query objectName="kafka.network:type=RequestMetrics,name=RequestsPerSec,request=FetchConsumer" attributes="OneMinuteRate" resultAlias="Kafka.RequestsPerSec.FetchConsumer.#attribute#"/>
        <query objectName="kafka.network:type=RequestMetrics,name=RequestsPerSec,request=FetchFollower" attributes="OneMinuteRate" resultAlias="Kafka.RequestsPerSec.FetchFollower.#attribute#"/>

        <!-- Log flush rate and time -->
        <query objectName="kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs" resultAlias="Kafka.LogFlushStats.LogFlushRateAndTimeMs.#attribute#"/>

        <!-- Partition counts -->
        <query objectName="kafka.server:type=ReplicaManager,name=PartitionCount" attribute="Value" resultAlias="Kafka.Topic.PartitionCount.#attribute#"/>

        <!-- of under replicated partitions (|ISR| < |all replicas|)  -->
        <query objectName="kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions" resultAlias="Kafka.ReplicaManager.UnderReplicatedPartitions.#attribute#"/>

        <!-- Is controller active on broker -->
        <query objectName="kafka.controller:type=KafkaController,name=ActiveControllerCount" resultAlias="Kafka.KafkaController.ActiveControllerCount.#attribute#"/>

        <!-- Leader election rate -->
        <query objectName="kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs" attributes="OneMinuteRate,Max,Min,75thPercentile,95thPercentile" resultAlias="Kafka.ControllerStats.LeaderElectionRateAndTimeMs.#attribute#"/>

        <!-- Leader replica counts -->
        <query objectName="kafka.server:type=ReplicaManager,name=LeaderCount" resultAlias="Kafka.ReplicaManager.LeaderCount.#attribute#"/>

        <!-- Max lag in messages btw follower and leader replicas -->
        <query objectName="kafka.server:type=ReplicaFetcherManager,name=MaxLag,clientId=Replica" resultAlias="Kafka.ReplicaFetcherManager.MaxLag.#attribute#"/>
        <!-- Lag in messages per follower replica -->
        <query objectName="kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=([-.\w]+)" resultAlias="Kafka.ConsumerLag.clientId.#attribute#"/>
        <query objectName="kafka.server:type=FetcherLagMetrics,name=ConsumerLag,topic=([-.\w]+)" resultAlias="Kafka.ConsumerLag.topic.#attribute#"/>
        <query objectName="kafka.server:type=FetcherLagMetrics,name=ConsumerLag,partition=([0-9]+)" resultAlias="Kafka.ConsumerLag.partition.#attribute#"/>

        <!-- Requests waiting in the producer purgatory -->
        <query objectName="kafka.server:type=ProducerRequestPurgatory,name=PurgatorySize" resultAlias="Kafka.ProducerRequestPurgatory.PurgatorySize.#attribute#"/>
        <!-- Requests waiting in the fetch purgatory -->
        <query objectName="kafka.server:type=FetchRequestPurgatory,name=PurgatorySize" resultAlias="Kafka.FetchRequestPurgatory.PurgatorySize.#attribute#"/>

        <!-- Request total time -->
        <query objectName="kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce" attributes="Max,Min,75thPercentile,95thPercentile" resultAlias="Kafka.RequestMetrics.TotalTimeMs.Produce.#attribute#"/>
        <query objectName="kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchConsumer" attributes="Max,Min,75thPercentile,95thPercentile" resultAlias="Kafka.RequestMetrics.TotalTimeMs.FetchConsumer.#attribute#"/>
        <query objectName="kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchFollower" attributes="Max,Min,75thPercentile,95thPercentile" resultAlias="Kafka.RequestMetrics.TotalTimeMs.FetchFollower.#attribute#"/>

        <!--Time the request waiting in the request queue -->
        <query objectName="kafka.network:type=RequestMetrics,name=QueueTimeMs,request=Produce" attributes="OneMinuteRate,Max,Min,75thPercentile,95thPercentile" resultAlias="Kafka.RequestMetrics.QueueTimeMs.Produce.#attribute#"/>
        <query objectName="kafka.network:type=RequestMetrics,name=QueueTimeMs,request=FetchConsumer" attributes="OneMinuteRate,Max,Min,75thPercentile,95thPercentile" resultAlias="Kafka.RequestMetrics.QueueTimeMs.FetchConsumer.#attribute#"/>
        <query objectName="kafka.network:type=RequestMetrics,name=QueueTimeMs,request=Produce" attributes="OneMinuteRate,Max,Min,75thPercentile,95thPercentile" resultAlias="Kafka.RequestMetrics.QueueTimeMs.Produce.#attribute#"/>

        <!-- Time the request being processed at the leader -->
        <query objectName="kafka.network:type=RequestMetrics,name=LocalTimeMs,request=Produce" attributes="Max,Min,75thPercentile,95thPercentile" resultAlias="Kafka.RequestMetrics.LocalTimeMs.Produce.#attribute#"/>
        <query objectName="kafka.network:type=RequestMetrics,name=LocalTimeMs,request=FetchConsumer" attributes="Max,Min,75thPercentile,95thPercentile" resultAlias="Kafka.RequestMetrics.LocalTimeMs.FetchConsumer.#attribute#"/>
        <query objectName="kafka.network:type=RequestMetrics,name=LocalTimeMs,request=FetchFollower" attributes="Max,Min,75thPercentile,95thPercentile" resultAlias="Kafka.RequestMetrics.LocalTimeMs.FetchFollower.#attribute#"/>

        <!-- Time the request waits for the follower -->
        <query objectName="kafka.network:type=RequestMetrics,name=RemoteTimeMs,request=Produce" attributes="Max,Min,75thPercentile,95thPercentile" resultAlias="Kafka.RequestMetrics.RemoteTimeMs.Produce.#attribute#"/>
        <query objectName="kafka.network:type=RequestMetrics,name=RemoteTimeMs,request=FetchConsumer" attributes="Max,Min,75thPercentile,95thPercentile" resultAlias="Kafka.RequestMetrics.RemoteTimeMs.FetchConsumer.#attribute#"/>
        <query objectName="kafka.network:type=RequestMetrics,name=RemoteTimeMs,request=FetchFollower" attributes="Max,Min,75thPercentile,95thPercentile" resultAlias="Kafka.RequestMetrics.RemoteTimeMs.FetchFollower.#attribute#"/>

        <!-- Time to send the response -->
        <query objectName="kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request=Produce" attributes="Max,Min,75thPercentile,
95thPercentile" resultAlias="Kafka.RequestMetrics.ResponseSendTimeMs.Produce.#attribute#"/>
        <query objectName="kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request=FetchConsumer" attributes="Max,Min,75thPercentile,
95thPercentile" resultAlias="Kafka.RequestMetrics.ResponseSendTimeMs.FetchConsumer.#attribute#"/>
        <query objectName="kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request=FetchFollower" attributes="Max,Min,75thPercentile,95thPercentile" resultAlias="Kafka.RequestMetrics.ResponseSendTimeMs.FetchFollower.#attribute#"/>

        <!-- Number of messages the consumer lags behind the producer by -->
        <query objectName="kafka.consumer:type=ConsumerFetcherManager,name=MaxLag,clientId=([-.\w]+)" resultAlias="Kafka.ConsumerFetcherManager.MaxLag.clientId.#attribute#"/>

        <!-- The average fraction of time the network processors are idle -->
        <query objectName="kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent" resultAlias="Kafka.SocketServer.NetworkProcessorAvgIdlePercent.#attribute#"/>

        <!-- The average fraction of time the request handler threads are idle -->
        <query objectName="kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent" attributes="OneMinuteRate" resultAlias="Kafka.KafkaRequestHandlerPool.RequestHandlerAvgIdlePercent.#attribute#"/>

        <!-- Quota metrics per client-id -->
        <query objectName="kafka.server:type=Produce,client-id=([-.\w]+)" resultAlias="Kafka.Produce.client-id.#attribute#"/>
        <query objectName="kafka.server:type=Fetch,client-id=([-.\w]+)" resultAlias="Kafka.Fetch.client-id.#attribute#"/>

        </queries>

        <!--
                <outputWriter class="org.jmxtrans.agent.RollingFileOutputWriter">
                <fileName>/tmp/roll-jmxing.log</fileName>
                <maxFileSize>1024</maxFileSize>
                <maxBackupIndex>10</maxBackupIndex>
        </outputWriter>
        -->

        <outputWriter class="org.jmxtrans.agent.FileOverwriterOutputWriter">
                <fileName>/tmp/jmxing.log</fileName>
                <showTimeStamp>false</showTimeStamp>
        </outputWriter>

        <collectIntervalInSeconds>60</collectIntervalInSeconds>
</jmxtrans-agent>

 



consumer监控

    Fetch Metrics: kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.w]+)

records-lag-max			The maximum lag in terms of number of records for any partition in this window. An increasing value over time is your best indication that the consumer group is not keeping up with the producers.
fetch-size-avg			The average number of bytes fetched per request.
fetch-size-max			The max number of bytes fetched per request.
bytes-consumed-rate		The average number of bytes consumed per second.
records-per-request-avg	The average number of records in each request.
records-consumed-rate	The average number of records consumed per second
fetch-rate				The number of fetch requests per second.
fetch-latency-avg		The average time taken for a fetch request.
fetch-latency-max		The max time taken for a fetch request.
fetch-throttle-time-avg	The average throttle time in ms. When quotas are enabled, the broker may delay fetch requests in order to throttle a consumer which has exceeded its limit. This metric indicates how throttling time has been added to fetch requests on average.
fetch-throttle-time-avg	The maximum throttle time in ms

    Topic-level Fetch Metrics: kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.w]+),topic=([-.w]+)

fetch-size-avg			The average number of bytes fetched per request for a specific topic.
fetch-size-max			The maximum number of bytes fetched per request for a specific topic.
bytes-consumed-rate		The average number of bytes consumed per second for a specific topic.
records-per-request-avg	The average number of records in each request for a specific topic.
records-consumed-rate	The average number of records consumed per second for a specific topic.

    Consumer Group Metrics: kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.w]+)

assigned-partitions		The number of partitions currently assigned to this consumer.
commit-latency-avg		The average time taken for a commit request.
commit-latency-max		The max time taken for a commit request
commit-rate				The number of commit calls per second.
join-rate				The number of group joins per second. Group joining is the first phase of the rebalance protocol. A large value indicates that the consumer group is unstable and will likely be coupled with increased lag.
join-time-avg			The average time taken for a group rejoin. This value can get as high as the configured session timeout for the consumer, but should usually be lower.
join-time-max			The max time taken for a group rejoin. This value should not get much higher than the configured session timeout for the consumer.
sync-rate				The number of group syncs per second. Group synchronization is the second and last phase of the rebalance protocol. Similar to join-rate, a large value indicates group instability.
sync-time-avg			The average time taken for a group sync.
sync-time-max			The max time taken for a group sync.
heartbeat-rate			The average number of heartbeats per second. After a rebalance, the consumer sends heartbeats to the coordinator to keep itself active in the group. You can control this using the heartbeat.interval.ms setting for the consumer. You may see a lower rate than configured if the processing loop is taking more time to handle message batches. Usually this is OK as long as you see no increase in the join rate.
heartbeat-response-time-max	The max time taken to receive a response to a heartbeat request.
last-heartbeat-seconds-ago	The number of seconds since the last controller heartbeat.

    Global Request Metrics: kafka.consumer:type=consumer-metrics,client-id=([-.w]+)

request-latency-avg   The average request latency in ms.
request-latency-max	  The maximum request latency in ms.
request-rate		  The average number of requests sent per second.
response-rate		  The average number of responses received per second.
incoming-byte-rate	  The average number of incoming bytes received per second from all servers.
outgoing-byte-rate	  The average number of outgoing bytes sent per second to all servers.

    Global Connection Metrics: kafka.consumer:type=consumer-metrics,client-id=([-.w]+)

connection-count		    The current number of active connections.
connection-creation-rate	New connections established per second in the window.
connection-close-rate		Connections closed per second in the window.
io-ratio					The fraction of time the I/O thread spent doing I/O.
io-time-ns-avg				The average length of time for I/O per select call in nanoseconds.
io-wait-ratio				The fraction of time the I/O thread spent waiting.
select-rate					Number of times the I/O layer checked for new I/O to perform per second.
io-wait-time-ns-avg			The average length of time the I/O thread spent waiting for a socket ready for reads or writes in nanoseconds.

    Per-Broker Metrics: kafka.consumer:type=consumer-node-metrics,client-id=([-.w]+),node-id=([0-9]+)

request-size-max	The maximum size of any request sent in the window for a broker.
request-size-avg	The average size of all requests in the window for a broker.
request-rate		The average number of requests sent per second to the broker.
response-rate		The average number of responses received per second from the broker.
incoming-byte-rate	The average number of bytes received per second from the broker.
outgoing-byte-rate	The average number of bytes sent per second to the broker.

 



producer监控

    Global Request Metrics: kafka.producer:type=producer-metrics,client-id=([-.w]+)

request-latency-avg		The average request latency in ms.
request-latency-max		The maximum request latency in ms.
request-rate			The average number of requests sent per second.
response-rate			The average number of responses received per second.
incoming-byte-rate		The average number of incoming bytes received per second from all servers.
outgoing-byte-rate		The average number of outgoing bytes sent per second to all servers.

    Global Connection Metrics: kafka.producer:type=producer-metrics,client-id=([-.w]+)

connection-count			The current number of active connections.
connection-creation-rate	New connections established per second in the window.
connection-close-rate		Connections closed per second in the window.
io-ratio					The fraction of time the I/O thread spent doing I/O.
io-time-ns-avg				The average length of time for I/O per select call in nanoseconds.
io-wait-ratio				The fraction of time the I/O thread spent waiting.
select-rate					Number of times the I/O layer checked for new I/O to perform per second.
io-wait-time-ns-avg			The average length of time the I/O thread spent waiting for a socket ready for reads or writes in nanoseconds.

    Per-Broker Metrics: kafka.producer:type=producer-node-metrics,client-id=([-.w]+),node-id=([0-9]+)

request-size-max	The maximum size of any request sent in the window for a broker.
request-size-avg	The average size of all requests in the window for a broker.
request-rate		The average number of requests sent per second to the broker.
response-rate		The average number of responses received per second from the broker.
incoming-byte-rate	The average number of bytes received per second from the broker.
outgoing-byte-rate	The average number of bytes sent per second to the broker.

    Per-Topic Metrics: kafka.producer:type=producer-topic-metrics,client-id=([-.w]+),topic=([-.w]+)

byte-rate			The average number of bytes sent per second for a topic.
record-send-rate	The average number of records sent per second for a topic.
compression-rate	The average compression rate of record batches for a topic.
record-retry-rate	The average per-second number of retried record sends for a topic.
record-error-rate	The average per-second number of record sends that resulted in errors for a topic.