spark怎么读取kafka的数据

转载

hackernew 2025-01-03 11:48:33

文章标签 spark怎么读取kafka的数据 kafka spark ci 文章分类 Spark 大数据

大家都知道在spark1.3版本后，kafkautil里面提供了两个创建dstream的方法，一个是老版本中有的createStream方法，还有一个是后面新加的createDirectStream方法。关于这两个方法的优缺点，官方已经说的很详细(http:///docs/latest/streaming-kafka-integration.html)，总之就是createDirectStream性能会更好一点，通过新方法创建出来的dstream的rdd partition和kafka的topic的partition是一一对应的，通过低阶API直接从kafka的topic消费消息，但是它不再往zookeeper中更新consumer offsets，使得基于zk的consumer offsets的监控工具都会失效。

官方只是蜻蜓点水般的说了一下可以在foreachRDD中更新zookeeper上的offsets:

[plain] view plain copy

1. directKafkaStream.foreachRDD { rdd =>   
2.      val offsetRanges = rdd.asInstanceOf[HasOffsetRanges]  
3.      // offsetRanges.length = # of Kafka partitions being consumed  
4.      ...  
5.  }

对应

Exactly-once semantics要自己去实现了，大致的实现思路就是在driver启动的时候先从zk上获得consumer offsets信息，createDirectStream有两个重载方法，其中一个可以设置从任意offsets位置开始消费，部分代码如下：

[plain] view plain copy

1. def createDirectStream(implicit streamingConfig: StreamingConfig, kc: KafkaCluster) = {  
2.   
3.       val extractors = streamingConfig.getExtractors()  
4.       //从zookeeper上读取offset开始消费message  
5.       val messages = {  
6.         val kafkaPartitionsE = kc.getPartitions(streamingConfig.topicSet)  
7.         if (kafkaPartitionsE.isLeft) throw new SparkException("get kafka partition failed:")  
8.         val kafkaPartitions = kafkaPartitionsE.right.get  
9.         val consumerOffsetsE = kc.getConsumerOffsets(streamingConfig.group, kafkaPartitions)  
10.         if (consumerOffsetsE.isLeft) throw new SparkException("get kafka consumer offsets failed:")  
11.         val consumerOffsets = consumerOffsetsE.right.get  
12.         consumerOffsets.foreach {  
13.           case (tp, n) => println("===================================" + tp.topic + "," + tp.partition + "," + n)  
14.         }  
15.         KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, String)](  
16.           ssc, kafkaParams, consumerOffsets, (mmd: MessageAndMetadata[String, String]) => (mmd.key, mmd.message))  
17.       }  
18.       messages  
19.     }

这里会有几个问题，就是在一个group是新的consumer group时，即首次消费，zk上海没有相应的group offsets目录,这时要先初始化一下zk上的offsets目录，或者是zk上记录的offsets已经过时，由于kafka有定时清理策略，直接从zk上的offsets开始消费会报ArrayOutofRange异常，即找不到offsets所属的index文件了，针对这两种情况，做了以下处理：

[plain] view plain copy

1. def setOrUpdateOffsets(implicit streamingConfig: StreamingConfig, kc: KafkaCluster): Unit = {  
2.     streamingConfig.topicSet.foreach(topic => {  
3.       println("current topic:" + topic)  
4.       var hasConsumed = true  
5.       val kafkaPartitionsE = kc.getPartitions(Set(topic))  
6.       if (kafkaPartitionsE.isLeft) throw new SparkException("get kafka partition failed:")  
7.       val kafkaPartitions = kafkaPartitionsE.right.get  
8.       val consumerOffsetsE = kc.getConsumerOffsets(streamingConfig.group, kafkaPartitions)  
9.       if (consumerOffsetsE.isLeft) hasConsumed = false  
10.       if (hasConsumed) {  
11.         //如果有消费过，有两种可能，如果streaming程序执行的时候出现kafka.common.OffsetOutOfRangeException，说明zk上保存的offsets已经过时了，即kafka的定时清理策略已经将包含该offsets的文件删除。  
12.         //针对这种情况，只要判断一下zk上的consumerOffsets和leaderEarliestOffsets的大小，如果consumerOffsets比leaderEarliestOffsets还小的话，说明是过时的offsets,这时把leaderEarliestOffsets更新为consumerOffsets  
13.         val leaderEarliestOffsets = kc.getEarliestLeaderOffsets(kafkaPartitions).right.get  
14.         println(leaderEarliestOffsets)  
15.         val consumerOffsets = consumerOffsetsE.right.get  
16.         val flag = consumerOffsets.forall {  
17.           case (tp, n) => n < leaderEarliestOffsets(tp).offset  
18.         }  
19.         if (flag) {  
20.           println("consumer group:" + streamingConfig.group + " offsets已经过时，更新为leaderEarliestOffsets")  
21.           val offsets = leaderEarliestOffsets.map {  
22.             case (tp, offset) => (tp, offset.offset)  
23.           }  
24.           kc.setConsumerOffsets(streamingConfig.group, offsets)  
25.         }  
26.         else {  
27.           println("consumer group:" + streamingConfig.group + " offsets正常，无需更新")  
28.         }  
29.       }  
30.       else {  
31.         //如果没有被消费过，则从最新的offset开始消费。  
32.         val leaderLatestOffsets = kc.getLatestLeaderOffsets(kafkaPartitions).right.get  
33.         println(leaderLatestOffsets)  
34.         println("consumer group:" + streamingConfig.group + " 还未消费过，更新为leaderLatestOffsets")  
35.         val offsets = leaderLatestOffsets.map {  
36.           case (tp, offset) => (tp, offset.offset)  
37.         }  
38.         kc.setConsumerOffsets(streamingConfig.group, offsets)  
39.       }  
40.     })  
41.   }

这里又碰到了一个问题，从consumer offsets到leader latest offsets中间延迟了很多消息，在下一次启动的时候，首个batch要处理大量的消息，会导致spark-submit设置的资源无法满足大量消息的处理而导致崩溃。因此在spark-submit启动的时候多加了一个配置:--conf spark.streaming.kafka.maxRatePerPartition=10000。限制每秒钟从topic的每个partition最多消费的消息条数，这样就把首个batch的大量的消息拆分到多个batch中去了，为了更快的消化掉delay的消息，可以调大计算资源和把这个参数调大。

OK，driver启动的问题解决了，那么接下来处理处理完消息后更新zk offsets的工作，这里要注意是在处理完之后再更新，想想如果你消费了消息先更新zk offset在去处理消息将处理好的消息保存到其他地方去，如果后一步由于处理消息的代码有BUG失败了，前一步已经更新了zk了，会导致这部分消息虽然被消费了但是没被处理，等你把处理消息的BUG修复再重新提交后，这部分消息在下次启动的时候不会再被消费了，因为你已经更新了ZK OFFSETS，针对这些因素考虑，部分代码实现如下：

[plain] view plain copy

仔细想一想，还是没有实现精确一次的语义，写入mongo和更新ZK由于不是一个事务的，如果更新mongo成功，然后更新ZK失败，则下次启动的时候这个批次的数据就被重复计算，对于UV由于是addToSet去重操作，没什么影响，但是PV是inc操作就会多算这一个批次的的数据，其实如果batch time比较短的话，其实都还是可以接受的。

1. def updateZKOffsets(rdd: RDD[(String, String)])(implicit streamingConfig: StreamingConfig, kc: KafkaCluster): Unit = {  
2.     println("rdd not empty,update zk offset")  
3.     val offsetsList = rdd.asInstanceOf[HasOffsetRanges].offsetRanges  
4.   
5.   
6.     for (offsets <- offsetsList) {  
7.       val topicAndPartition = TopicAndPartition(offsets.topic, offsets.partition)  
8.       val o = kc.setConsumerOffsets(streamingConfig.group, Map((topicAndPartition, offsets.untilOffset)))  
9.       if (o.isLeft) {  
10.         println(s"Error updating the offset to Kafka cluster: ${o.left.get}")  
11.       }  
12.     }  
13.   }  
14.   
15.   def processData(messages: InputDStream[(String, String)])(implicit streamingConfig: StreamingConfig, kc: KafkaCluster): Unit = {  
16.     messages.foreachRDD(rdd => {  
17.       if (!rdd.isEmpty()) {  
18.   
19.         val datamodelRDD = streamingConfig.relation match {  
20.           case "1" =>  
21.             val (topic, _) = streamingConfig.topic_table_mapping  
22.             val extractor = streamingConfig.getExtractor(topic)  
23.             // Create direct kafka stream with brokers and topics  
24.             val topicsSet = Set(topic)  
25.             val datamodel = rdd.filter(msg => {  
26.               extractor.filter(msg)  
27.             }).map(msg => extractor.msgToRow(msg))  
28.             datamodel  
29.           case "2" =>  
30.             val (topics, _) = streamingConfig.topic_table_mapping  
31.             val extractors = streamingConfig.getExtractors(topics)  
32.             val topicsSet = topics.split(",").toSet  
33.   
34.             //kafka msg为key-value形式,key用来对msg进行分区用的,为了散列存储消息,采集器那边key采用的是:topic|加一个随机数的形式,例如:rd_e_pal|20,split by |取0可以拿到对应的topic名字,这样union在一起的消息可以区分出来自哪一个topic  
35.             val datamodel = rdd.filter(msg => {  
36.               //kafka msg为key-value形式,key用来对msg进行分区用的,为了散列存储消息,采集器那边key采用的是:topic|加一个随机数的形式,例如:rd_e_pal|20,split by |取0可以拿到对应的topic名字,这样union在一起的消息可以区分出来自哪一个topic  
37.               val keyValid = msg != null && msg._1 != null && msg._1.split("\\|").length == 2  
38.               if (keyValid) {  
39.                 val topic = msg._1.split("\\|")(0)  
40.                 val (_, extractor) = extractors.find(p => {  
41.                   p._1.equalsIgnoreCase(topic)  
42.                 }).getOrElse(throw new RuntimeException("配置文件中没有找到topic:" + topic + " 对应的extractor"))  
43.                 //trim去掉末尾的换行符,否则取最后一个字段时会有一个\n  
44.                 extractor.filter(msg._2.trim)  
45.               }  
46.               else {  
47.                 false  
48.               }  
49.   
50.             }).map {  
51.               case (key, msgContent) =>  
52.                 val topic = key.split("\\|")(0)  
53.                 val (_, extractor) = extractors.find(p => {  
54.                   p._1.equalsIgnoreCase(topic)  
55.                 }).getOrElse(throw new RuntimeException("配置文件中没有找到topic:" + topic + " 对应的extractor"))  
56.                 extractor.msgToRow((key, msgContent))  
57.             }  
58.             datamodel  
59.         }  
60.         //先处理消息  
61.         processRDD(datamodelRDD)  
62.         //再更新offsets  
63.         updateZKOffsets(rdd)  
64.       }  
65.     })  
66.   }  
67.   
68.   def processRDD(rdd: RDD[Row])(implicit streamingConfig: StreamingConfig) = {  
69.     if (streamingConfig.targetType == "mongo") {  
70.       val target = streamingConfig.getTarget().asInstanceOf[MongoTarget]  
71.       if (!MongoDBClient.db.collectionExists(target.collection)) {  
72.         println("create collection:" + target.collection)  
73.         MongoDBClient.db.createCollection(target.collection, MongoDBObject("storageEngine" -> MongoDBObject("wiredTiger" -> MongoDBObject())))  
74.         val coll = MongoDBClient.db(target.collection)  
75.         //创建ttl index  
76.         if (target.ttlIndex) {  
77.           val indexs = coll.getIndexInfo  
78.           if (indexs.find(p => p.get("name") == "ttlIndex") == None) {  
79.             coll.createIndex(MongoDBObject(target.ttlColumn -> 1), MongoDBObject("expireAfterSeconds" -> target.ttlExpire, "name" -> "ttlIndex"))  
80.           }  
81.         }  
82.       }  
83.   
84.     }  
85.   
86.     val (_, table) = streamingConfig.topic_table_mapping  
87.     val schema = streamingConfig.getTableSchema(table)  
88.   
89.     // Get the singleton instance of SQLContext  
90.     val sqlContext = HIVEContextSingleton.getInstance(rdd.sparkContext)  
91.   
92.     // Convert RDD[String] to RDD[case class] to DataFrame  
93.     val dataFrame = sqlContext.createDataFrame(rdd, schema)  
94.   
95.     // Register as table  
96.     dataFrame.registerTempTable(table)  
97.   
98.     // Do word count on table using SQL and print it  
99.     val results = sqlContext.sql(streamingConfig.sql)  
100.     //select dt,hh(vtm) as hr,app_key, collect_set(device_id) as deviceids  from rd_e_app_header where dt=20150401 and hh(vtm)='01' group by dt,hh(vtm),app_key limit 100 ;  
101.     //          results.show()  
102.     streamingConfig.targetType match {  
103.       case "mongo" => saveToMongo(results)  
104.       case "show" => results.show()  
105.     }  
106.   
107.   }  
108.   
109.   
110.   def saveToMongo(df: DataFrame)(implicit streamingConfig: StreamingConfig) = {  
111.     val target = streamingConfig.getTarget().asInstanceOf[MongoTarget]  
112.     val coll = MongoDBClient.db(target.collection)  
113.     val result = df.collect()  
114.     if (result.size > 0) {  
115.       val bulkWrite = coll.initializeUnorderedBulkOperation  
116.       result.foreach(row => {  
117.         val id = row(target.pkIndex)  
118.         val setFields = target.columns.filter(p => p.op == "set").map(f => (, row(f.index))).toArray  
119.         val incFields = target.columns.filter(p => p.op == "inc").map(f => {  
120.           (, row(f.index).asInstanceOf[Long])  
121.         }).toArray  
122.         //        obj=obj.++($addToSet(MongoDBObject("test"->MongoDBObject("$each"->Array(3,4)),"test1"->MongoDBObject("$each"->Array(1,2)))))  
123.         var obj = MongoDBObject()  
124.         var addToSetObj = MongoDBObject()  
125.         target.columns.filter(p => p.op == "addToSet").foreach(col => {  
126.           col.mType match {  
127.             case "Int" =>  
128.               addToSetObj = addToSetObj.++( -> MongoDBObject("$each" -> row(col.index).asInstanceOf[ArrayBuffer[Int]]))  
129.             case "Long" =>  
130.               addToSetObj = addToSetObj.++( -> MongoDBObject("$each" -> row(col.index).asInstanceOf[ArrayBuffer[Long]]))  
131.             case "String" =>  
132.               addToSetObj = addToSetObj.++( -> MongoDBObject("$each" -> row(col.index).asInstanceOf[ArrayBuffer[String]]))  
133.           }  
134.   
135.         })  
136.         if (addToSetObj.size > 0) obj = obj.++($addToSet(addToSetObj))  
137.         if (incFields.size > 0) obj = obj.++($inc(incFields: _*))  
138.         if (setFields.size > 0) obj = obj.++($set(setFields: _*))  
139.         bulkWrite.find(MongoDBObject("_id" -> id)).upsert().updateOne(obj)  
140.       })  
141.       bulkWrite.execute()  
142.     }  
143.   }

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。