一、描述

      在CDH 5.2及更高版本中,Flume包含Kafka源和sink。 使用这些从Kafka到Hadoop或从任何Flume源到Kafka的流数据。

二、条件

      在CDH 5.7及更高版本中,Kafka的Flume连接器仅适用于Kafka 2.0及更高版本。

三、要点

      不要配置Kafka源以将数据发送到Kafka接收器。 如果这样做,Kafka源会在事件头中设置主题,覆盖sink配置并创建无限循环,在源和sink之间来回发送消息。 

      如果需要同时使用源和接收器,请使用拦截器修改事件头并设置不同的主题。

四、实践

     (一)、Kafka来源

         使用Kafka源将Kafka主题中的数据流式传输到Hadoop。 Kafka源可以与任何Flume接收器结合使用,可以方便地将Kafka数据写入HDFS,HBase和Solr。

     1、以下Flume配置示例使用Kafka源向HDFS接收器发送数据:

​​flume_get_kafka_then_send_data_to_hdfs.txt​​

tier1.sources  = source1
tier1.channels = channel1
tier1.sinks = sink1

tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
#zookeeper地址
tier1.sources.source1.zookeeperConnect = hadoop1:2181,hadoop2:2181,hadoop3:2181
#topic名称
tier1.sources.source1.topic = 20161121a
tier1.sources.source1.groupId = flume
tier1.sources.source1.channels = channel1
tier1.sources.source1.interceptors = i1
tier1.sources.source1.interceptors.i1.type = timestamp
tier1.sources.source1.kafka.consumer.timeout.ms = 100

tier1.channels.channel1.type = memory
tier1.channels.channel1.capacity = 10000
tier1.channels.channel1.transactionCapacity = 1000

tier1.sinks.sink1.type = hdfs
#上传的Hdfs路径
tier1.sinks.sink1.hdfs.path = /user/root/testspark/%{topic}/%y-%m-%d
tier1.sinks.sink1.hdfs.rollInterval = 5
tier1.sinks.sink1.hdfs.rollSize = 0
tier1.sinks.sink1.hdfs.rollCount = 0
tier1.sinks.sink1.hdfs.fileType = DataStream
tier1.sinks.sink1.channel = channel1

       注意:

            为了获得更高的吞吐量,请配置多个Kafka源以从同一主题读取。

            如果配置具有相同groupID的所有源,并且主题包含多个分区,则每个源从不同的分区集中读取数据,从而提高吞吐率。

     2、下表描述了Kafka源支持的参数; 所需属性以粗体列出:

Property Name

Default     Value

Description

type

 

Must be set to org.apache.flume.source.kafka.KafkaSource.

zookeeperConnect

 

The URI of the ZooKeeper server or quorum used by Kafka. This can be a single host (for example, zk01.example.com:2181) or a comma-separated list of hosts in a ZooKeeper quorum (for example, zk01.example.com:2181,zk02.example.com:2181, zk03.example.com:2181).

topic

 

The Kafka topic from which this source reads messages. Flume supports only one topic per source.

groupID

flume

The unique identifier of the Kafka consumer group. Set the samegroupID in all sources to indicate that they belong to the same consumer group.

batchSize

1000

The maximum number of messages that can be written to a channel in a single batch.

batchDurationMillis

1000

The maximum time (in ms) before a batch is written to the channel. The batch is written when the batchSize limit orbatchDurationMillis limit is reached, whichever comes first.

Other properties supported by the Kafka consumer

 

Used to configure the Kafka consumer used by the Kafka source. You can use any consumer properties supported by Kafka. Prepend the consumer property name with the prefix kafka. (for example, kafka.fetch.min.bytes). See the ​​Kafka documentation​​for the full list of Kafka consumer properties.

     3、调整注释:

          Kafka源覆盖两个Kafka使用者参数:
         (1)、auto.commit.enable由源设置为false,并且每个批处理都提交。 

            为了提高性能,请使用kafka.auto.commit.enable设置将其设置为true。 如果源在提交前失败,则可能导致数据丢失。
         (2)、consumer.timeout.ms设置为10,因此当Flume轮询Kafka的新数据时,它等待不超过10毫秒的数据可用。 

            将此设置为更高的值可以降低CPU利用率,因为轮询频率较低,但在向通道写入批处理时引入了延迟。

 

     (二)Kafka水槽

          1、描述:使用Kafka sink从Flume源向Kafka发送数据。 您可以使用Kafka接收器除了Flume接收器如HBase或HDFS。

          2、以下Flume配置示例使用带有exec源的Kafka sink:

         

tier1.sources  = source1
tier1.channels = channel1
tier1.sinks = sink1

tier1.sources.source1.type = exec
tier1.sources.source1.command = /usr/bin/vmstat 1
tier1.sources.source1.channels = channel1

tier1.channels.channel1.type = memory
tier1.channels.channel1.capacity = 10000
tier1.channels.channel1.transactionCapacity = 1000

tier1.sinks.sink1.type = org.apache.flume.sink.kafka.KafkaSink
#Kafka的topic和broker
tier1.sinks.sink1.topic = 20161121a
tier1.sinks.sink1.brokerList = hadoop3:9092,hadoop4:9092
tier1.sinks.sink1.channel = channel1
tier1.sinks.sink1.batchSize = 20

          3、下表描述了Kafka sink支持的参数; 所必须的需属性以粗体列出。

          

Property Name

Default Value

Description

type

 

Must be set to org.apache.flume.sink.kafka.KafkaSink.

brokerList

 

The brokers the Kafka sink uses to discover topic partitions, formatted as a comma-separated list ofhostname:port entries. You do not need to specify the entire list of brokers, but Cloudera recommends that you specify at least two for high availability.

topic

default-flume-topic

The Kafka topic to which messages are published by default. If the event header contains a topic field, the event is published to the designated topic, overriding the configured topic.

batchSize

100

The number of messages to process in a single batch. Specifying a larger batchSize can improve throughput and increase latency.

request.required.acks

0

The number of replicas that must acknowledge a message before it is written successfully. Possible values are 0 (do not wait for an acknowledgement), 1 (wait for the leader to acknowledge only), and -1 (wait for all replicas to acknowledge). To avoid potential loss of data in case of a leader failure, set this to -1.

Other properties supported by the Kafka producer

 

Used to configure the Kafka producer used by the Kafka sink. You can use any producer properties supported by Kafka. Prepend the producer property name with the prefix kafka. (for example, kafka.compression.codec). See the ​​Kafka documentation​​ for the full list of Kafka producer properties.

  备注:

       Kafka sink使用FlumeEvent头中的主题和键属性来确定在Kafka中发送事件的位置。 

       1、如果头包含topic属性,那么该事件将发送到指定的主题,覆盖已配置的主题。 

       2、如果标题包含键属性,那么该键用于分隔主题中的事件。 具有相同键的事件将发送到同一个分区。 

       3、如果未指定键参数,则事件将随机分布到分区。 使用这些属性来控制通过Flume源或拦截器发送事件的主题和分区。

 

     (三)Kafka渠道

          1、描述:CDH 5.3及更高版本包括除了现有存储器和文件通道之外的到Flume的Kafka通道。 您可以使用Kafka渠道:

                (1)、直接从Kafka写入Hadoop,而不使用源代码。
                (2)、直接从Flume源写入Kafka,而不需要额外的缓冲。
                (3)、作为任何源/汇组合的可靠和高可用性通道。

 

          2、以下Flume配置使用带有exec源和hdfs sink的Kafka通道:

             

tier1.sources = source1
tier1.channels = channel1
tier1.sinks = sink1

tier1.sources.source1.type = exec
tier1.sources.source1.command = /usr/bin/vmstat 1
tier1.sources.source1.channels = channel1

tier1.channels.channel1.type = org.apache.flume.channel.kafka.KafkaChannel
tier1.channels.channel1.capacity = 10000
tier1.channels.channel1.transactionCapacity = 1000
#kafka的topic和broker地址
tier1.channels.channel1.brokerList = hadoop3:9092,hadoop4:9092
tier1.channels.channel1.topic = channel2
#zookeeper地址
tier1.channels.channel1.zookeeperConnect = hadoop1:2181,hadoop2:2181,hadoop3:2181
tier1.channels.channel1.parseAsFlumeEvent = true

tier1.sinks.sink1.type = hdfs
#hdfs路径
tier1.sinks.sink1.hdfs.path = /user/root/testspark/20161121a
tier1.sinks.sink1.hdfs.rollInterval = 5
tier1.sinks.sink1.hdfs.rollSize = 0
tier1.sinks.sink1.hdfs.rollCount = 0
tier1.sinks.sink1.hdfs.fileType = DataStream
tier1.sinks.sink1.channel = channel1

 

          3、下表描述了Kafka通道支持的参数; 所需属性以粗体列出。

Property Name

Default Value

Description

type

 

Must be set to org.apache.flume.channel.kafka.KafkaChannel.

brokerList

 

The brokers the Kafka channel uses to discover topic partitions, formatted as a comma-separated list ofhostname:port entries. You do not need to specify the entire list of brokers, but Cloudera recommends that you specify at least two for high availability.

zookeeperConnect

 

The URI of the ZooKeeper server or quorum used by Kafka. This can be a single host (for example, zk01.example.com:2181) or a comma-separated list of hosts in a ZooKeeper quorum (for example, zk01.example.com:2181,zk02.example.com:2181, zk03.example.com:2181).

topic

flume-channel

The Kafka topic the channel will use.

groupID

flume

The unique identifier of the Kafka consumer group the channel uses to register with Kafka.

parseAsFlumeEvent

true

Set to true if a Flume source is writing to the channel and expects AvroDataums with the FlumeEvent schema (org.apache.flume.source.avro.AvroFlumeEvent) in the channel. Set to false if other producers are writing to the topic that the channel is using.

readSmallestOffset

false

If true, reads all data in the topic. If false, reads only data written after the channel has started. Only used whenparseAsFlumeEvent is false.

kafka.consumer.timeout.ms

100

Polling interval when writing to the sink.

Other properties supported by the Kafka producer

 

Used to configure the Kafka producer. You can use any producer properties supported by Kafka. Prepend the producer property name with the prefix kafka. (for example,kafka.compression.codec). See the ​​Kafka documentation​​ for the full list of Kafka producer properties.

 

英文文档地址:https://www.cloudera.com/documentation/kafka/latest/topics/kafka_flume.html