本文阅读需具有一定Flume Kafka SparkStreaming的基础知识。

1、Flume以及Kafka环境搭建。

      版本的选择,参考http://spark.apache.org/docs/latest/streaming-kafka-integration.html

      

 

spark-streaming-kafka-0-8

spark-streaming-kafka-0-10

Broker Version

0.8.2.1 or higher

0.10.0 or higher

Api Stability

Stable

Experimental

Language Support

Scala, Java, Python

Scala, Java

Receiver DStream

Yes

No

Direct DStream

Yes

Yes

SSL / TLS Support

No

Yes

Offset Commit Api

No

Yes

Dynamic Topic Subscription

No

Yes

     kafka的版本注意保持和搭建环境选择的版本相同。Spark2.0  支持Kafka0.8.2.1及以上版本,支持Flume1.6.0

     故而这里给出Flume的Maven,对于Kafka读者可以自行参看上述官网。

    

groupId = org.apache.spark
 artifactId = spark-streaming-flume_2.11
 version = 2.0.2

2、Flume设计架构

      使用 基于Flume的美团日志收集系统(一)架构和设计http://www.aboutyun.com/thread-8317-1-1.html(出处: about云开发)中提到的架构(在这里对博主无私共享,表示感谢)其整体架构如下图所示:

     

flume读取kafka数据写入hive flume+kafka+spark_ci

2-1)客户端配置

       参看网址:http://flume.apache.org/FlumeUserGuide.html

       为了便于观察,笔者将为客户端设置3个sinks,前两个用于向flume_collector 发送数据,最后一个用于打印当前日志,配置内容如下:

agent.sources = source1
agent.channels = channel1 channel3
agent.sinks = sink1 sink2 sink3
agent.sinkgroups = group1
#配置数据来源
agent.sources.source1.channels = channel1 channel3
agent.sources.source1.type = exec
agent.sources.source1.shell = /bin/bash -c
agent.sources.source1.restart = true
agent.sources.source1.batchTimeout = 3000
agent.sources.source1.batchSize = 10000
agent.sources.source1.threads = 5
agent.sources.source1.selector.type = replicating
agent.sources.source1.command = tail -n +0 -F /usr/tomcat/tomcat8/logs/localhost_access_log.2016-11-16.txt

#设置数据通道
agent.channels.channel1.capacity = 100000
# Each channel's type is defined.
agent.channels.channel1.type = memory
# Other config values specific to each type of channel(sink or source)
# can be defined as well
# In this case, it specifies the capacity of the memory channel
agent.channels.channel1.transactionCapacity = 10000

agent.channels.channel3.capacity = 100000
agent.channels.channel3.type = memory
agent.channels.channel3.transactionCapacity = 10000
#设置数据输出
agent.sinks.sink1.channel = channel1
agent.sinks.sink1.type = avro
agent.sinks.sink1.hostname = lenovo1
agent.sinks.sink1.port = 12343
agent.sinks.sink1.batch-size = 0

agent.sinks.sink2.channel = channel1
agent.sinks.sink2.type = avro
agent.sinks.sink2.hostname = lenovo2
agent.sinks.sink2.port = 12343
agent.sinks.sink2.batch-size = 0

agent.sinks.sink3.type = logger
agent.sinks.sink3.channel = channel3
agent.sinks.sink3.sink.maxBytesToLog = 1000

# clientMainAgent sinks group
agent.sinkgroups.group1.sinks = sink1 sink2
# load_balance type
agent.sinkgroups.group1.processor.type = load_balance
agent.sinkgroups.group1.processor.backoff   = true
agent.sinkgroups.group1.processor.selector  = random

在这里需要注意的是使用了两个channel,具体原因,通过下图一目了然:

flume读取kafka数据写入hive flume+kafka+spark_ci_02


Source中提取的数据只有一份,channel相当于一个缓冲池,如果所有的Sink均取自于一个Channel,那么造成的结果就是一份数据被分割。


2-2)中心服务机配置

agent.sources = source1
agent.channels = channel1 channel2
agent.sinks = sink1 sink2

#配置数据来源
agent.sources.source1.type = avro
agent.sources.source1.selector.type = replicating
agent.sources.source1.channels = channel1 channel2
agent.sources.source1.bind = 0.0.0.0
agent.sources.source1.port = 12343
agent.channels = channel1 channel2
agent.sinks = sink1 sink2

#配置数据来源
agent.sources.source1.type = avro
agent.sources.source1.selector.type = replicating
agent.sources.source1.channels = channel1 channel2
agent.sources.source1.bind = 0.0.0.0
agent.sources.source1.port = 12343

#设置数据通道
agent.channels.channel1.capacity = 1000
# Each channel's type is defined.
agent.channels.channel1.type = memory
# Other config values specific to each type of channel(sink or source)
# can be defined as well
# In this case, it specifies the capacity of the memory channel
agent.channels.channel1.capacity = 100


agent.channels.channel2.capacity = 1000
agent.channels.channel2.type = memory
agent.channels.channel2.capacity = 100

#设置数据输出
#以hdfs方式输出
agent.sinks.sink1.channel = channel1
agent.sinks.sink1.type = hdfs
agent.sinks.sink1.hdfs.useLocalTimeStamp = true
agent.sinks.sink1.hdfs.path = hdfs://lenovo1:9000/flume/events/%Y/%m/%d/%H/%M
agent.sinks.sink1.hdfs.filePrefix = l2_log
agent.sinks.sink1.hdfs.minBlockReplicas = 1
agent.sinks.sink1.hdfs.fileType = DataStream
agent.sinks.sink1.hdfs.writeFormat = Text
agent.sinks.sink1.hdfs.rollInterval = 60
agent.sinks.sink1.hdfs.rollSize = 102400
agent.sinks.sink1.hdfs.rollCount = 0 
agent.sinks.sink1.hdfs.idleTimeout = 0
agent.sinks.sink1.hdfs.batchSize = 100


#以kafka形式输出
agent.sinks.sink2.channel = channel2
agent.sinks.sink2.type = org.apache.flume.sink.kafka.KafkaSink
agent.sinks.sink2.kafka.topic = mytopic
agent.sinks.sink2.kafka.bootstrap.servers = lenovo1:9092,lenovo2:9092,lenovo3:9092,lenovo4:9092,lenovo6:9092,lenovo7:9092
agent.sinks.sink2.kafka.flumeBatchSize = 20
agent.sinks.sink2.kafka.producer.acks = 0



在这里,笔者要着重讲解与HDFS Sink 相关的几个参数:

hdfs.rollIterval   ;hdfs.rollSize ; hdfs.rollCount ;hdfs.batchSize ;

hdfs.rollIterval   :Number of seconds to wait before rolling current file (0 = never roll based on time interval) 

经笔者实验之后解释为:从创建tmp文件开始,直到多长时间后会触发flush操作,log日志如下:

2016-11-16 10:30:53,058 INFO hdfs.BucketWriter: Closing hdfs://lenovo1:9000/flume/events/2016/11/16/10/29/l1_log.1479263392993.tmp
2016-11-16 10:30:53,093 INFO hdfs.BucketWriter: Renaming hdfs://lenovo1:9000/flume/events/2016/11/16/10/29/l1_log.1479263392993.tmp to hdfs://lenovo1:9000/flume/events/2016/11/16/10/29/l1_log.1479263392993


hdfs.rollSize:File size to trigger roll, in bytes (0: never roll based on file size)

解释为:当文件大小超过多少时,触发flush操作


hdfs.rollCount :Number of events written to file before it rolled(0 = never roll based on number of events)

解释为:当接收事件大于某个值时,触发flush操作


以上3个参数的设置是or关系,即只要满足任一要求即发生flush操作,需要注意:

如果读者设置了hdfs.path = %Y/%d/%H/%M 等与时间相关的文件创建及更新时,要保证在设置的最小单位时间内,能够触发上述条件之一,否则将一直不会触发flush操作。


hdfs.batchSize :number of events written to file before it is flushed to HDFS

解释为:一次性地从channel中提取事件的数量,batchSize大小直接影响到整个流程的性能。


经实践提出如下参数设置建议:

rollInterval 设置为最小单位时间向下一级时间转换所需的单位时间个数,比如文件设置为根据当前时间的分钟数进行创建,则rollInterval应该设置为60。

rollCount   设置为0,即不以rollCount作为触发flush操作的条件,当然这个前提是要存储到HDFS中

rollSize      设置为HDFS中一个块的整数倍大小(最好一次写入一个块),如果过小将产生大量HDFS小文件,严重浪费存储空间。

batchSize  设置为channel中的transactionCapacity大小,保证channel中的一次事务为一次写入HDFS。


channel中参数建议:

capacity :The maximum number of events stored in the channel

解释为:channel中最大存放的事件数量。

transactionCapacity :The maximum number of events the channel will take from a source or give to a sink pertransaction

解释为:每个事务所对应的最大事件数量。

byteCapacity : Maximum total bytes of memory allowed as a sum of all events in this channel. 

body, which is the reason for providing thebyteCapacityBufferPercentage

                             Defaults to a computed value equal to 80% of the maximum memory available to the JVM (i.e. 80% of the -Xmx value passed on the command line).

解释为:最大的字节容量,一般为JVM的80%,并且只考虑了事件的主体部分,没有考虑header中内容。

上述两个参数的设置要根据内存情况进行设置。


source中参数建议:

batchSize :The max number of lines to read and send to the channel at a time

解释为:一次从文件中读取的行数

batchTimeout :Amount of time (in milliseconds) to wait, if the buffer size was not reached, before data is pushed downstream

解释为:在buffer没有存满的前提下,提交等待的时间。

在不考虑buffer的情况下,batchSize / batchTimeOut 的值应该大于数据的产生速率,否则将出现数据不能及时处理的情况。


3、编写SparkStreaming Kafka代码,参看http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html,代码如下:

object StreamingKafka {
  Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
  Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.WARN)
  def main(args: Array[String]) {

    val kafkaParams = Map[String, Object](
        "bootstrap.servers" -> "lenovo1:9092,lenovo2:9092,lenovo3:9092,lenovo4:9092,lenovo6:9092,lenovo7:9092",
        "key.deserializer" -> classOf[StringDeserializer],
        "value.deserializer" -> classOf[StringDeserializer],
        "group.id" -> "groups",
        "auto.offset.reset" -> "latest",
        "enable.auto.commit" -> (false: java.lang.Boolean)
      )
    val topics = Array("mytopic")

    val conf = new SparkConf()
      .setMaster("spark://lenovo1:7077")
      .set("spark.eventLog.enabled", "true")
      .set("spark.eventLog.dir", "hdfs://cluster/spark/directory")
      .setAppName(this.getClass.getSimpleName)
      .setJars(Array("E:\\ScalaSpace\\Spark2_Streaming\\out\\artifacts\\Spark2.0_Archetype.jar"))
    val ssc = new StreamingContext(conf, Seconds(5))
    ssc.checkpoint("hdfs://cluster/spark/checkpoint/streaming")

    val stream = KafkaUtils.createDirectStream[String,String](
      ssc,
      LocationStrategies.PreferConsistent,
      Subscribe[String,String](topics, kafkaParams)
    )
    stream.map(record => record.value).foreachRDD(rdd => {
      println("start ---------------")
      rdd.collect().foreach(println)
    })
    // stream.print()
    ssc.start()
    ssc.awaitTermination()
  }
}

当提交到集群时可能会报出classNotFound错误,解决方法为将缺失包放入Spark jars目录下:

flume读取kafka数据写入hive flume+kafka+spark_kafka_03

至此一个简单的日志收集便完成了,但是对于使用exec方式的source官方并不提倡,鉴于此读者可选择其他方案来替代。