Apache Spark 1.3.0引入了Direct API,利用Kafka的低层次API从Kafka集群中读取数据,并且在Spark Streaming系统里面维护偏移量相关的信息,并且通过这种方式去实现零数据丢失(zero data loss)相比使用基于Receiver的方法要高效。但是因为是Spark Streaming系统自己维护Kafka的读偏移量,而Spark Streaming系统并没有将这个消费的偏移量发送到Zookeeper中,这将导致那些基于偏移量的Kafka集群监控软件(比如:Apache Kafka监控之Kafka Web Console、Apache Kafka监控之KafkaOffsetMonitor等)失效。本文就是基于为了解决这个问题,使得我们编写的Spark Streaming程序能够在每次接收到数据之后自动地更新Zookeeper中Kafka的偏移量。

  我们从Spark的官方文档可以知道,维护Spark内部维护Kafka便宜了信息是存储在HasOffsetRanges类的offsetRanges中,我们可以在Spark Streaming程序里面获取这些信息:

1. val offsetsList = rdd.asInstanceOf[HasOffsetRanges].offsetRanges


这样我们就可以获取所以分区消费信息,只需要遍历offsetsList,然后将这些信息发送到Zookeeper即可更新Kafka消费的偏移量。完整的代码片段如下:


1. val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)  
2.       messages.foreachRDD(rdd => {  
3.         val offsetsList = rdd.asInstanceOf[HasOffsetRanges].offsetRanges  
4.         val kc = new KafkaCluster(kafkaParams)  
5. for (offsets < - offsetsList) {  
6. "test-topic", offsets.partition)  
7. 0), Map((topicAndPartition, offsets.untilOffset)))  
8. if (o.isLeft) {  
9. "Error updating the offset to Kafka cluster: ${o.left.get}")  
10.           }  
11.         }  
12. })



KafkaCluster类用于建立和Kafka集群的链接相关的操作工具类,我们可以对Kafka中Topic的每个分区设置其相应的偏移量Map((topicAndPartition, offsets.untilOffset)),然后调用KafkaCluster类的setConsumerOffsets方法去更新Zookeeper里面的信息,这样我们就可以更新Kafka的偏移量,最后我们就可以通过KafkaOffsetMonitor之类软件去监控Kafka中相应Topic的消费信息,下图是KafkaOffsetMonitor的监控情况:



kafka 倒退分区偏移量 kafka的偏移量_kafka

从图中我们可以看到KafkaOffsetMonitor监控软件已经可以监控到Kafka相关分区的消费情况,这对监控我们整个Spark Streaming程序来非常重要,因为我们可以任意时刻了解Spark读取速度。另外,KafkaCluster工具类的完整代码如下:

1. package org.apache.spark.streaming.kafka  
2.   
3. import kafka.api.OffsetCommitRequest  
4. import kafka.common.{ErrorMapping, OffsetMetadataAndError, TopicAndPartition}  
5. import kafka.consumer.SimpleConsumer  
6. import org.apache.spark.SparkException  
7. import org.apache.spark.streaming.kafka.KafkaCluster.SimpleConsumerConfig  
8.   
9. import scala.collection.mutable.ArrayBuffer  
10. import scala.util.Random  
11. import scala.util.control.NonFatal  
12.   
13. class KafkaCluster(val kafkaParams: Map[String, String]) extends Serializable {  
14.   type Err = ArrayBuffer[Throwable]  
15. 
16.   @transient private var _config: SimpleConsumerConfig = null  
17.   
18. def config: SimpleConsumerConfig = this.synchronized {  
19. if (_config == null) {  
20.       _config = SimpleConsumerConfig(kafkaParams)  
21.     }  
22.     _config  
23.   }  
24.   
25. def setConsumerOffsets(groupId: String,  
26.                          offsets: Map[TopicAndPartition, Long]  
27.                           ): Either[Err, Map[TopicAndPartition, Short]] = {  
28.     setConsumerOffsetMetadata(groupId, offsets.map { kv =>  
29.       kv._1 -> OffsetMetadataAndError(kv._2)  
30.     })  
31.   }  
32.   
33. def setConsumerOffsetMetadata(groupId: String,  
34.                                 metadata: Map[TopicAndPartition, OffsetMetadataAndError]  
35.                                  ): Either[Err, Map[TopicAndPartition, Short]] = {  
36.     var result = Map[TopicAndPartition, Short]()  
37.     val req = OffsetCommitRequest(groupId, metadata)  
38.     val errs = new Err  
39.     val topicAndPartitions = metadata.keySet  
40.     withBrokers(Random.shuffle(config.seedBrokers), errs) { consumer =>  
41.       val resp = consumer.commitOffsets(req)  
42.       val respMap = resp.requestInfo  
43.       val needed = topicAndPartitions.diff(result.keySet)  
44.       needed.foreach { tp: TopicAndPartition =>  
45.         respMap.get(tp).foreach { err: Short =>  
46. if (err == ErrorMapping.NoError) {  
47.             result += tp -> err  
48. else {  
49.             errs.append(ErrorMapping.exceptionFor(err))  
50.           }  
51.         }  
52.       }  
53. if (result.keys.size == topicAndPartitions.size) {  
54. return Right(result)  
55.       }  
56.     }  
57.     val missing = topicAndPartitions.diff(result.keySet)  
58. "Couldn't set offsets for ${missing}"))  
59.     Left(errs)  
60.   }  
61.   
62. def withBrokers(brokers: Iterable[(String, Int)], errs: Err)  
63.                          (fn: SimpleConsumer => Any): Unit = {  
64.     brokers.foreach { hp =>  
65.       var consumer: SimpleConsumer = null  
66. try {  
67.         consumer = connect(hp._1, hp._2)  
68.         fn(consumer)  
69.       } catch {  
70.         case NonFatal(e) =>  
71.           errs.append(e)  
72. finally {  
73. if (consumer != null) {  
74.           consumer.close()  
75.         }  
76.       }  
77.     }  
78.   }  
79.   
80. def connect(host: String, port: Int): SimpleConsumer =  
81.     new SimpleConsumer(host, port, config.socketTimeoutMs,  
82.       config.socketReceiveBufferBytes, config.clientId)  
83. }