使用spark 将kafka数据写入到hive spark读取kafka数据

转载

小屁孩 2023-09-24 20:39:25

Spark Streaming消费kafka数据有两种方式，一种是基于接收器消费kafka数据，使用Kafka的高级API；另一种是直接消费方式，使用Kafka的低级API。

下面，详细说明、对比下这两种方式的优缺点。

一、Receiver-based Approach

这种方式，采用Kafka的高级API，使用接收器接收kafka的消息。接收器在接收到kafka数据后，把数据保存在Spark executor中，然后Spark Streaming任务再从中获取数据。

采用这种方式，默认配置情况下，当发生故障的时候，kafka数据可能会丢失。为了保证数据不丢吗，需要在Spark Streaming的处理过程中加入日志预写机制，就是将从kafka中获取的消息同步保存到日志中一份，比如分布式文件系统HDFS，当故障发生的时候，所有的数据都可以被恢复。

代码示例：

import org.apache.spark.streaming.kafka._

 val kafkaStream = KafkaUtils.createStream(streamingContext, 
     [ZK quorum], [consumer group id], [per-topic number of Kafka partitions to consume])

二、Direct Approach

从spark1.3开始，推出这种无接收者机制的“直接”方式，以确保更健壮的端到端的保证。和接收器接收数据对比，这种方式每个批次周期性地获取kafka每个partition的指定范围的最新数据。

和方式一对比，方式二有如下几个有点：

1）简化并行度

不需要创建多个kafka数据流再合并多个输入流。Spark Streaming会创建和kafka partition个数相同的RDD partition数并行去消费kafka的数据，所以kafka和RDD的partition之间是一对一的映射关系，这样可以更便于理解和调优。

2）高效

为了保证数据不丢，在方式一中，需要采用日志预写机制Write Ahead Log，该日志对数据进行了复制。实际上这种方式是低效的，因为数据被消费了两次---一次是被kafka复制，另外一次是被预写机制复制。第二种方法消除了这种低效，因为没有接收机，不需要日志预写机制。只要保留在kafka中没有被清除的消息，就可以被消费。

3）精准一次语义

第一种传统方式采用kafka的高级API，将offset存储在Zookeeper中。这种方式虽然可以确保消息不会丢失，但是当故障发生时，由于SparkStreaming接收的数据和Zookeeper中保存的偏移量不一致，消息可能会被重复消费。在第二种方法中，使用kafka的简单API，不需要Zookeeper保存kafka的偏移量，这样就可以杜绝第一种方法中的隐患，即使故障出现，SparkStreaming也可以保证精确的一次性消费。

注意，这种方式不会更新Zookeeper中的偏移量，所以基于Zookeeper的kafka监视器无法展示kafka消息消费进展。这种情况下，只有手动去获取每个批次的偏移量，并手动更新到Zookeeper中。

代码示例：

import org.apache.spark.streaming.kafka._

 val directKafkaStream = KafkaUtils.createDirectStream[
     [key class], [value class], [key decoder class], [value decoder class] ](
     streamingContext, [map of Kafka parameters], [set of topics to consume])

另外，可以传递一个messageHandlertoDirectStream来访问messageAndMetadata，其中包括有关当前消息的元数据，并将其转化为任何所需要的类型。

示例代码：

import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.kafka.common.serialization.StringDeserializer

import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka010._

/**
 * Consumes messages from one or more topics in Kafka and does wordcount.
 * Usage: DirectKafkaWordCount <brokers> <topics>
 *   <brokers> is a list of one or more Kafka brokers
 *   <groupId> is a consumer group name to consume from topics
 *   <topics> is a list of one or more kafka topics to consume from
 *
 * Example:
 *    $ bin/run-example streaming.DirectKafkaWordCount broker1-host:port,broker2-host:port \
 *    consumer-group topic1,topic2
 */
object DirectKafkaWordCount {
  def main(args: Array[String]): Unit = {
    if (args.length < 3) {
      System.err.println(s"""
        |Usage: DirectKafkaWordCount <brokers> <groupId> <topics>
        |  <brokers> is a list of one or more Kafka brokers
        |  <groupId> is a consumer group name to consume from topics
        |  <topics> is a list of one or more kafka topics to consume from
        |
        """.stripMargin)
      System.exit(1)
    }

    StreamingExamples.setStreamingLogLevels()

    val Array(brokers, groupId, topics) = args

    // Create context with 2 second batch interval
    val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(2))

    // Create direct kafka stream with brokers and topics
    val topicsSet = topics.split(",").toSet
    val kafkaParams = Map[String, Object](
      ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
      ConsumerConfig.GROUP_ID_CONFIG -> groupId,
      ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
      ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer])
    val messages = KafkaUtils.createDirectStream[String, String](
      ssc,
      LocationStrategies.PreferConsistent,
      ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams))

    // Get the lines, split them into words, count the words and print
    val lines = messages.map(_.value)
    val words = lines.flatMap(_.split(" "))
    val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
    wordCounts.print()

    // Start the computation
    ssc.start()
    ssc.awaitTermination()
  }
}

至此，SparkStreaming消费Kafka的两种方式介绍完毕！

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。