Flume组件简介

原创

神谕03 2020-08-27 12:47:18 博主文章分类：Flume ©著作权

文章标签 Flume source channel sink 拦截器 文章分类 大数据

©著作权归作者所有：来自51CTO博客作者神谕03的原创作品，请联系作者获取转载授权，否则将追究法律责任

Flume组件简介

Flume组件介绍：

FLume 是通过agent(代理)为最小的独立运行单位,agent包括Source,Channel,Sink

Source: 1）NetCat Source 使用TCP和UDP两种协议方式，使用方法基本相同，通过监听指定的***IP和端口***来传输数据，它会将监听到的每一行数据转化成一个Event写入到Channel中

2）Avro Source（读音类似于[ævrə]） (https://blog.csdn.net/zhouleilei/article/details/8537831) Avro Source可以定制avro-client发送一个指定的文件给Flume agent，Avro源使用***Avro RPC机制***，Flume主要的RPC Source也是 Avro Source，它使用Netty-Avro inter-process的通信（IPC）协议来通信，因此可以用java或JVM语言发送数据到Avro Source端。

3）Exec类型的Source a1.sources.r1.command=tail -f /tmp/err.log 要执行的***命令***

4）Taildir Source 监控指定的***多个文件***，一旦文件内有新写入的数据，就会将其写入到指定的sink内，本来源可靠性高，不会丢失数据，建议使用

5）Spooling Directory类型的 Source：指定的文件加入到“自动搜集”目录中。flume会持续监听这个***目录***，把文件当做source来处理

6）Kafka Source 支持从***Kafka***指定的topic中读取数据

7）自定义Source（继承封装）

Channel: 1）Memory Channel（内存Channels） events存储在配置最大大小的***内存***队列中。对于流量较高和由于agent故障而准备丢失数据的流程来说，这是一个理想的选择。

2）file channel（磁盘持久化） File Channel是一个持久化的隧道（channel)，数据安全并且只要磁盘空间足够，它就可以将数据存储到***磁盘***上。

3） JDBC Channel（数据库） events存储在持久化存储***数据库***中

4）Kafka Channel events存储在***Kafka***集群中。Kafka提供高可用性和高可靠性，所以当agent或者kafka broker 崩溃时，events能马上被其他sinks可用。

5）自定义channel

Sink: 1）File Roll 存储于***本地系统***中。

a1.sinks.s1.type=file_roll
a1.sinks.s1.sink.directory=/home/work/rolldata
a1.sinks.s1.sink.rollInterval=60

2）Avro 是实现***多级流动***、扇出流(1到多) 扇入流(多到1) 的基础。

a1.sinks.s1.type=avro
a1.sinks.s1.hostname=192.168.234.212
a1.sinks.s1.port=9999

3）HDFS 此Sink将事件写入到Hadoop分布式文件系统***HDFS***中目前它支持创建文本文件和序,列化文件。

拦截器： 设置在source和channel之间。source接收到的时间，在写入channel之前，拦截器都可以进行转换或者删除这些事件。每个拦截器只处理同一个source接收到的事件。可以自定义拦截器。

1）Timestamp (***时间戳***拦截器)

a1.sources.r1.interceptors = timestamp 
a1.sources.r1.interceptors.timestamp.type=timestamp 
a1.sources.r1.interceptors.timestamp.preserveExisting=false

2）Host (***主机***拦截器)

a1.sources.r1.interceptors = host 
a1.sources.r1.interceptors.host.type=host 
a1.sources.r1.interceptors.host.useIP=false 
a1.sources.r1.interceptors.timestamp.preserveExisting=true

3）Static(***静态***拦截器)

a1.sources.r1.interceptors = static 
a1.sources.r1.interceptors.static.type=static 
a1.sources.r1.interceptors.static.key=logs 
a1.sources.r1.interceptors.static.value=logFlume 
a1.sources.r1.interceptors.static.preserveExisting=false

4）REGEX_FILTER (***正则过滤***拦截器)

a1.sources.r1.interceptors = regex 
a1.sources.r1.interceptors.regex.type=REGEX_FILTER 
a1.sources.r1.interceptors.regex.regex=(rm)|(kill) 
a1.sources.r1.interceptors.regex.excludeEvents=false

5）Regex_Extractor(***正则匹配***拦截器)

a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = regex_extractor
#hostname is bigdata111 ip is 192.168.212.111
a1.sources.r1.interceptors.i1.regex = hostname is (.*?) ip is (.*)

6）UUID(***UUID***拦截器)

a1.sources.sources1.interceptors = i1
a1.sources.sources1.interceptors.i1.type = org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder
a1.sources.sources1.interceptors.i1.headerName = uuid
a1.sources.sources1.interceptors.i1.preserveExisting = true
a1.sources.sources1.interceptors.i1.prefix = UUID-

7）自定义拦截器（java）