flume读取kafka数据写入hive flume+kafka+spark

转载

mob64ca14101b2f 2024-01-31 01:43:12

文章标签 kafka flume sparkStreaming hdfs ci 文章分类 Hive 大数据

本文阅读需具有一定Flume Kafka SparkStreaming的基础知识。

1、Flume以及Kafka环境搭建。

版本的选择，参考http://spark.apache.org/docs/latest/streaming-kafka-integration.html

	spark-streaming-kafka-0-8	spark-streaming-kafka-0-10
Broker Version	0.8.2.1 or higher	0.10.0 or higher
Api Stability	Stable	Experimental
Language Support	Scala, Java, Python	Scala, Java
Receiver DStream	Yes	No
Direct DStream	Yes	Yes
SSL / TLS Support	No	Yes
Offset Commit Api	No	Yes
Dynamic Topic Subscription	No	Yes

kafka的版本注意保持和搭建环境选择的版本相同。Spark2.0 支持Kafka0.8.2.1及以上版本，支持Flume1.6.0

故而这里给出Flume的Maven，对于Kafka读者可以自行参看上述官网。

groupId = org.apache.spark
 artifactId = spark-streaming-flume_2.11
 version = 2.0.2

2、Flume设计架构

使用基于Flume的美团日志收集系统(一)架构和设计http://www.aboutyun.com/thread-8317-1-1.html(出处: about云开发)中提到的架构（在这里对博主无私共享，表示感谢）其整体架构如下图所示：

flume读取kafka数据写入hive flume+kafka+spark_ci

2-1）客户端配置

参看网址：http://flume.apache.org/FlumeUserGuide.html

为了便于观察，笔者将为客户端设置3个sinks，前两个用于向flume_collector 发送数据，最后一个用于打印当前日志，配置内容如下：

agent.sources = source1
agent.channels = channel1 channel3
agent.sinks = sink1 sink2 sink3
agent.sinkgroups = group1
#配置数据来源
agent.sources.source1.channels = channel1 channel3
agent.sources.source1.type = exec
agent.sources.source1.shell = /bin/bash -c
agent.sources.source1.restart = true
agent.sources.source1.batchTimeout = 3000
agent.sources.source1.batchSize = 10000
agent.sources.source1.threads = 5
agent.sources.source1.selector.type = replicating
agent.sources.source1.command = tail -n +0 -F /usr/tomcat/tomcat8/logs/localhost_access_log.2016-11-16.txt

#设置数据通道
agent.channels.channel1.capacity = 100000
# Each channel's type is defined.
agent.channels.channel1.type = memory
# Other config values specific to each type of channel(sink or source)
# can be defined as well
# In this case, it specifies the capacity of the memory channel
agent.channels.channel1.transactionCapacity = 10000

agent.channels.channel3.capacity = 100000
agent.channels.channel3.type = memory
agent.channels.channel3.transactionCapacity = 10000
#设置数据输出
agent.sinks.sink1.channel = channel1
agent.sinks.sink1.type = avro
agent.sinks.sink1.hostname = lenovo1
agent.sinks.sink1.port = 12343
agent.sinks.sink1.batch-size = 0

agent.sinks.sink2.channel = channel1
agent.sinks.sink2.type = avro
agent.sinks.sink2.hostname = lenovo2
agent.sinks.sink2.port = 12343
agent.sinks.sink2.batch-size = 0

agent.sinks.sink3.type = logger
agent.sinks.sink3.channel = channel3
agent.sinks.sink3.sink.maxBytesToLog = 1000

# clientMainAgent sinks group
agent.sinkgroups.group1.sinks = sink1 sink2
# load_balance type
agent.sinkgroups.group1.processor.type = load_balance
agent.sinkgroups.group1.processor.backoff   = true
agent.sinkgroups.group1.processor.selector  = random

在这里需要注意的是使用了两个channel，具体原因，通过下图一目了然：

flume读取kafka数据写入hive flume+kafka+spark_ci_02

Source中提取的数据只有一份，channel相当于一个缓冲池，如果所有的Sink均取自于一个Channel，那么造成的结果就是一份数据被分割。

2-2）中心服务机配置

agent.sources = source1
agent.channels = channel1 channel2
agent.sinks = sink1 sink2

#配置数据来源
agent.sources.source1.type = avro
agent.sources.source1.selector.type = replicating
agent.sources.source1.channels = channel1 channel2
agent.sources.source1.bind = 0.0.0.0
agent.sources.source1.port = 12343
agent.channels = channel1 channel2
agent.sinks = sink1 sink2

#配置数据来源
agent.sources.source1.type = avro
agent.sources.source1.selector.type = replicating
agent.sources.source1.channels = channel1 channel2
agent.sources.source1.bind = 0.0.0.0
agent.sources.source1.port = 12343

#设置数据通道
agent.channels.channel1.capacity = 1000
# Each channel's type is defined.
agent.channels.channel1.type = memory
# Other config values specific to each type of channel(sink or source)
# can be defined as well
# In this case, it specifies the capacity of the memory channel
agent.channels.channel1.capacity = 100


agent.channels.channel2.capacity = 1000
agent.channels.channel2.type = memory
agent.channels.channel2.capacity = 100

#设置数据输出
#以hdfs方式输出
agent.sinks.sink1.channel = channel1
agent.sinks.sink1.type = hdfs
agent.sinks.sink1.hdfs.useLocalTimeStamp = true
agent.sinks.sink1.hdfs.path = hdfs://lenovo1:9000/flume/events/%Y/%m/%d/%H/%M
agent.sinks.sink1.hdfs.filePrefix = l2_log
agent.sinks.sink1.hdfs.minBlockReplicas = 1
agent.sinks.sink1.hdfs.fileType = DataStream
agent.sinks.sink1.hdfs.writeFormat = Text
agent.sinks.sink1.hdfs.rollInterval = 60
agent.sinks.sink1.hdfs.rollSize = 102400
agent.sinks.sink1.hdfs.rollCount = 0 
agent.sinks.sink1.hdfs.idleTimeout = 0
agent.sinks.sink1.hdfs.batchSize = 100


#以kafka形式输出
agent.sinks.sink2.channel = channel2
agent.sinks.sink2.type = org.apache.flume.sink.kafka.KafkaSink
agent.sinks.sink2.kafka.topic = mytopic
agent.sinks.sink2.kafka.bootstrap.servers = lenovo1:9092,lenovo2:9092,lenovo3:9092,lenovo4:9092,lenovo6:9092,lenovo7:9092
agent.sinks.sink2.kafka.flumeBatchSize = 20
agent.sinks.sink2.kafka.producer.acks = 0

在这里，笔者要着重讲解与HDFS Sink 相关的几个参数：

hdfs.rollIterval ；hdfs.rollSize ； hdfs.rollCount ；hdfs.batchSize ；

hdfs.rollIterval ：Number of seconds to wait before rolling current file (0 = never roll based on time interval)

经笔者实验之后解释为：从创建tmp文件开始，直到多长时间后会触发flush操作，log日志如下：

2016-11-16 10:30:53,058 INFO hdfs.BucketWriter: Closing hdfs://lenovo1:9000/flume/events/2016/11/16/10/29/l1_log.1479263392993.tmp
2016-11-16 10:30:53,093 INFO hdfs.BucketWriter: Renaming hdfs://lenovo1:9000/flume/events/2016/11/16/10/29/l1_log.1479263392993.tmp to hdfs://lenovo1:9000/flume/events/2016/11/16/10/29/l1_log.1479263392993

hdfs.rollSize：File size to trigger roll, in bytes (0: never roll based on file size)

解释为：当文件大小超过多少时，触发flush操作

hdfs.rollCount ：Number of events written to file before it rolled(0 = never roll based on number of events)

解释为：当接收事件大于某个值时，触发flush操作

以上3个参数的设置是or关系，即只要满足任一要求即发生flush操作，需要注意：

如果读者设置了hdfs.path = %Y/%d/%H/%M 等与时间相关的文件创建及更新时，要保证在设置的最小单位时间内，能够触发上述条件之一，否则将一直不会触发flush操作。

hdfs.batchSize ：number of events written to file before it is flushed to HDFS

解释为：一次性地从channel中提取事件的数量，batchSize大小直接影响到整个流程的性能。

经实践提出如下参数设置建议：

rollInterval 设置为最小单位时间向下一级时间转换所需的单位时间个数，比如文件设置为根据当前时间的分钟数进行创建，则rollInterval应该设置为60。

rollCount 设置为0，即不以rollCount作为触发flush操作的条件，当然这个前提是要存储到HDFS中

rollSize 设置为HDFS中一个块的整数倍大小（最好一次写入一个块），如果过小将产生大量HDFS小文件，严重浪费存储空间。

batchSize 设置为channel中的transactionCapacity大小，保证channel中的一次事务为一次写入HDFS。

channel中参数建议：

capacity ：The maximum number of events stored in the channel

解释为：channel中最大存放的事件数量。

transactionCapacity ：The maximum number of events the channel will take from a source or give to a sink pertransaction

解释为：每个事务所对应的最大事件数量。

byteCapacity ： Maximum total bytes of memory allowed as a sum of all events in this channel.

body, which is the reason for providing thebyteCapacityBufferPercentage

Defaults to a computed value equal to 80% of the maximum memory available to the JVM (i.e. 80% of the -Xmx value passed on the command line).

解释为：最大的字节容量，一般为JVM的80%，并且只考虑了事件的主体部分，没有考虑header中内容。

上述两个参数的设置要根据内存情况进行设置。

source中参数建议：

batchSize ：The max number of lines to read and send to the channel at a time

解释为：一次从文件中读取的行数

batchTimeout ：Amount of time (in milliseconds) to wait, if the buffer size was not reached, before data is pushed downstream

解释为：在buffer没有存满的前提下，提交等待的时间。

在不考虑buffer的情况下，batchSize / batchTimeOut 的值应该大于数据的产生速率，否则将出现数据不能及时处理的情况。

3、编写SparkStreaming Kafka代码，参看http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html，代码如下：

object StreamingKafka {
  Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
  Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.WARN)
  def main(args: Array[String]) {

    val kafkaParams = Map[String, Object](
        "bootstrap.servers" -> "lenovo1:9092,lenovo2:9092,lenovo3:9092,lenovo4:9092,lenovo6:9092,lenovo7:9092",
        "key.deserializer" -> classOf[StringDeserializer],
        "value.deserializer" -> classOf[StringDeserializer],
        "group.id" -> "groups",
        "auto.offset.reset" -> "latest",
        "enable.auto.commit" -> (false: java.lang.Boolean)
      )
    val topics = Array("mytopic")

    val conf = new SparkConf()
      .setMaster("spark://lenovo1:7077")
      .set("spark.eventLog.enabled", "true")
      .set("spark.eventLog.dir", "hdfs://cluster/spark/directory")
      .setAppName(this.getClass.getSimpleName)
      .setJars(Array("E:\\ScalaSpace\\Spark2_Streaming\\out\\artifacts\\Spark2.0_Archetype.jar"))
    val ssc = new StreamingContext(conf, Seconds(5))
    ssc.checkpoint("hdfs://cluster/spark/checkpoint/streaming")

    val stream = KafkaUtils.createDirectStream[String,String](
      ssc,
      LocationStrategies.PreferConsistent,
      Subscribe[String,String](topics, kafkaParams)
    )
    stream.map(record => record.value).foreachRDD(rdd => {
      println("start ---------------")
      rdd.collect().foreach(println)
    })
    // stream.print()
    ssc.start()
    ssc.awaitTermination()
  }
}

当提交到集群时可能会报出classNotFound错误，解决方法为将缺失包放入Spark jars目录下：

flume读取kafka数据写入hive flume+kafka+spark_kafka_03