Flume高级应用

原创

DanielMaster 2022-03-01 14:24:34 ©著作权

文章标签 Flume hadoop 拦截器 hdfs 文章分类 代码人生

©著作权归作者所有：来自51CTO博客作者DanielMaster的原创作品，请联系作者获取转载授权，否则将追究法律责任

文章目录

1.Flume核心组件

1.1source类型

1.1.1netcat
1.1.2exec
1.1.3Spooling Directory
1.1.4avro

1.2channel类型

1.2.1memory channel
1.2.2file channel
1.2.3jdbc

1.3sink类型

1.3.1avro
1.3.2logger
1.3.3hdfs

2.拦截器

2.1时间戳拦截器
2.2主机名拦截器
2.3静态拦截器
2.4多个拦截器

3.高可用配置

1.Flume核心组件

1.1source类型

Source 负责接收 event 或通过特殊机制产生 event，并将 events 批量的放到一个或多个

1.1.1netcat

来自于主机的网络端口数据，一旦指定的主机的端口中有数据就会被作为数据源采集

a1.sources.r1.type = netcat
a1.sources.r1.bind = 指定监控的主机
a1.sources.r1.port = 指定数据源的端口

1.1.2exec

数据源来自于一个unix命令执行内容结果，它会对操作文档的内容的命令的执行结果进行监听，常用的命令有cat、tail、head

a1.sources.r1.type=exec
a1.sources.r1.command=tail |cat| head

案例(1.txt文件自行创建)：

# a1  当前这个agent的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 指定source   来自于端口的
a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /home/hadoop/datas/1.txt

# 指定channel
a1.channels.c1.type = memory

# 指定sink的类型  logger 控制台打印
a1.sinks.k1.type = logger

# 绑定 channel   sink   source 
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动：

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/ex01_exec.conf --name a1 -Dflume.root.logger=INFO,console

1.1.3Spooling Directory

数据源来自于一个文件夹下的所有文件，比如：/datas 1.txt 2.txt 3.txt

a1.sources.r1.type = spoordir
a1.sources.r1.spoolDir =  指定需要采集数据来源的文件夹

案例(datas文件夹自行创建)：

# a1  当前这个agent的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 指定source   来自于端口的
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /home/hadoop/datas
a1.sources.r1.fileSuffix = .finished

# 指定channel
a1.channels.c1.type = memory

# 指定sink的类型  logger 控制台打印
a1.sinks.k1.type = logger

# 绑定 channel   sink   source 
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动：

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/ex02_spool.conf --name a1 -Dflume.root.logger=INFO,console

1.1.4avro

数据源来自于avro port一般用于多agent串联，比如一个agent 向另一个agent 发送数据

agent1–avro port-- agent2

a1.sources.r1.type = avro 
a1.sources.r1.bind = 指定主机   这里的这个主机  avro sink中保持一致
a1.sources.r1.port =  指定端口  这里的这个主机  avro sink中保持一致

案例（conf文件自行创建）：

先对agent进行规划

agent1 hadoop01 source – netcat channel – memory sink – avro
agent2 hadoop02 source – avro source channel–memory sink – logger

agent1 :

# a1  当前这个agent的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 指定source   来自于端口的
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop01
a1.sources.r1.port= 44455

# 指定channel
a1.channels.c1.type = memory

# 指定sink的类型  logger 控制台打印
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop02
a1.sinks.k1.port = 44466

# 绑定 channel   sink   source 
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

agent2:

# a1  当前这个agent的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 指定source   
a1.sources.r1.type = avro
a1.sources.r1.bind = hadoop02
a1.sources.r1.port= 44466

# 指定channel
a1.channels.c1.type = memory

# 指定sink的类型  logger 控制台打印
a1.sinks.k1.type = logger

# 绑定 channel   sink   source 
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

先启动hadoop02上的agent2

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/ex03_agent02_avrosource.conf --name a1 -Dflume.root.logger=INFO,console

再启动 hadoop01上的agent1

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/ex03_agent01_avrosink.conf --name a1 -Dflume.root.logger=INFO,console

1.2channel类型

包含 event 驱动和轮询两种类型

1.2.1memory channel

# 数据存储内存中
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000  memory中存储的数据的最大条数
a1.channels.c1.transactionCapacity = 10000  每次提交的数据量

1.2.2file channel

是基于磁盘的

1.2.3jdbc

关于数据库的

1.3sink类型

sink 负责将 event传输到下一跳或最终目的地，成功后将 event 从channel移除

1.3.1avro

接收avro port的数据发送给另一个agent

a1.sinks.k1.type = avro 
a1.sinks.k1.hostname = 指定绑定的主机  一般指定下一个的agenT的主机
a1.sinks.k1.port= 指定数据存放的端口

1.3.2logger

将结果打印到控制台中

1.3.3hdfs

将结果收集到hdfs中

a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = 指定hdfs 的路径

案例(2.txt自行创建)：

# a1  当前这个agent的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 指定source   来自于端口的
a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /home/hadoop/datas/2.txt

# 指定channel
a1.channels.c1.type = memory

# 指定sink的类型  logger 控制台打印
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/data

# 绑定 channel   sink   source 
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动：

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/ex04_hdfs_sink.conf --name a1 -Dflume.root.logger=INFO,console

2.拦截器

用来拦截数据源给数据源进行初步处理，可以给数据源添加一个标识。可以在sink端对不同标识的数据进行不同的处理。event{header:{k=v}}

2.1时间戳拦截器

Timestamp Interceptor，拦截数据源给header信息中添加一个时间戳，header{timestamp= 142526273}

# 指定拦截器别名
a1.sources.r1.interceptors = i1
# 指定拦截器类型
a1.sources.r1.interceptors.i1.type = timestamp

案例（conf文件内容自拟）：

# a1  当前这个agent的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 指定source   来自于端口的
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop01
a1.sources.r1.port = 44477

# 指定source的拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp

# 指定channel
a1.channels.c1.type = memory

# 指定sink的类型  logger 控制台打印
a1.sinks.k1.type = logger

# 绑定 channel   sink   source 
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动：

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/ex05_intc01_time.conf --name a1 -Dflume.root.logger=INFO,console

2.2主机名拦截器

a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = host
a1.sources.r1.interceptors.i1.useIP = false

案例（conf文件内容自拟）：

# a1  当前这个agent的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 指定source   来自于端口的
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop01
a1.sources.r1.port = 44477

# 指定source的拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = host
# a1.sources.r1.interceptors.i1.useIP = false

# 指定channel
a1.channels.c1.type = memory

# 指定sink的类型  logger 控制台打印
a1.sinks.k1.type = logger

# 绑定 channel   sink   source 
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动：

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/ex06_host.conf --name a1 -Dflume.root.logger=INFO,console

2.3静态拦截器

Static Interceptor，这个拦截器可以手动指定key==value值的（最常用）

a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = datacenter
a1.sources.r1.interceptors.i1.value = NEW_YORK

案例（conf文件内容自拟）：

# a1  当前这个agent的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 指定source   来自于端口的
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop01
a1.sources.r1.port = 44477

# 指定source的拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = type
a1.sources.r1.interceptors.i1.value = netcat-hadoop01-44477

# 指定channel
a1.channels.c1.type = memory

# 指定sink的类型  logger 控制台打印
a1.sinks.k1.type = logger

# 绑定 channel   sink   source 
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动：

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/ex07_static.conf --name a1 -Dflume.root.logger=INFO,console

2.4多个拦截器

案例（conf文件内容自拟）：

# a1  当前这个agent的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 指定source   来自于端口的
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop01
a1.sources.r1.port = 44477

# 指定source的拦截器
a1.sources.r1.interceptors = i1 i2 i3
# 设置第一个拦截器
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = type
a1.sources.r1.interceptors.i1.value = netcat-hadoop01-44477

# 设置第二个拦截器
a1.sources.r1.interceptors.i2.type = host

# 设置第三个拦截器
a1.sources.r1.interceptors.i3.type = timestamp


# 指定channel
a1.channels.c1.type = memory

# 指定sink的类型  logger 控制台打印
a1.sinks.k1.type = logger

# 绑定 channel   sink   source 
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动：

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/ex08.conf --name a1 -Dflume.root.logger=INFO,console

综合案例：A、B 两台日志服务机器实时生产日志主要类型为 access.log、nginx.log、web.log

现在要求：把 A、B 机器中的 access.log、nginx.log、web.log 采集汇总到 C 机器上然后统一收集到 hdfs中。

agent1与agent2

# 当前这个agent的名字
a1.sources = r1 r2 r3
a1.sinks = k1
a1.channels = c1

# 指定source
# 指定 r1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /home/hadoop/datas/log/access.log
# 指定r1 对应的拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = logname
a1.sources.r1.interceptors.i1.value = access

# 指定  r2
a1.sources.r2.type = exec
a1.sources.r2.command = tail -f /home/hadoop/datas/log/ngix.log

a1.sources.r2.interceptors = i2
a1.sources.r2.interceptors.i2.type = static
a1.sources.r2.interceptors.i2.key = logname
a1.sources.r2.interceptors.i2.value = ngix

# 指定 r3
a1.sources.r3.type = exec
a1.sources.r3.command = tail -f /home/hadoop/datas/log/web.log
a1.sources.r3.interceptors = i3
a1.sources.r3.interceptors.i3.type = static
a1.sources.r3.interceptors.i3.key = logname
a1.sources.r3.interceptors.i3.value = web

# 指定channel
a1.channels.c1.type = memory

# 指定sink的类型 avro 
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop03
a1.sinks.k1.port = 55566


# 绑定 channel sink source 
a1.sources.r1.channels = c1
a1.sources.r2.channels = c1
a1.sources.r3.channels = c1
a1.sinks.k1.channel = c1

agent3

# a1  当前这个agent的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 指定source   
a1.sources.r1.type = avro
a1.sources.r1.bind = hadoop03
a1.sources.r1.port= 55566

# 指定channel
a1.channels.c1.type = memory

# 指定sink的类型  logger 控制台打印
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /source/log/%{logname}/%Y%m%d
a1.sinks.k1.hdfs.useLocalTimeStamp = true
# 回滚条件
a1.sinks.k1.hdfs.rollSize = 10240
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.rollCount =0 
a1.sinks.k1.hdfs.idleTimeout = 30

# 文件输出格式
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text

# 绑定 channel   sink   source 
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

在hadoop03上启动：

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/agent03_zh.conf --name a1 -Dflume.root.logger=INFO,console

在hadoop02上启动：

../bin/flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/agent02_zh.conf --name a1 -Dflume.root.logger=INFO,console

在hadoop01上启动：

../bin/flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/agent01_zh.conf --name a1 -Dflume.root.logger=INFO,console

3.高可用配置

架构

	agent
hadoop01	agent1
hadoop02	agent2、agent4
hadoop03	agent3、agent5

描述：123收集数据，4汇总，5备份

agent1、agent2、agent3

#agent name: agent1
a1.sources = r1
a1.channels = c1
a1.sinks = k1 k2

# 设置source 
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /home/hadoop/datas/log/web.log
a1.sources.r1.interceptors = i1 i2
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = Type
a1.sources.r1.interceptors.i1.value = LOGIN
a1.sources.r1.interceptors.i2.type = timestamp

# 设置channel 
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 设置sink 
# 将多个sink放在一个组中  组名
a1.sinkgroups = g1
# set k1
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop02
a1.sinks.k1.port = 52020

# set k2
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop03
a1.sinks.k2.port = 52020

#设置sink的 优先级   高可用
a1.sinkgroups.g1.sinks = k1 k2
#set设置失败切换
a1.sinkgroups.g1.processor.type = failover
# 设置优先级   1-10   越高 优先
a1.sinkgroups.g1.processor.priority.k1 = 10
a1.sinkgroups.g1.processor.priority.k2 = 1
# 时间间隔
a1.sinkgroups.g1.processor.maxpenalty = 10000 
# 设置 绑定关系  
a1.sources.r1.channels = c1   
a1.sinks.k1.channel = c1  
a1.sinks.k2.channel = c1

agent4

#set agent name
a2.sources = r1
a2.channels = c1
a2.sinks = k1

# source
a2.sources.r1.type = avro
## 当前主机为什么，就修改成什么主机名
a2.sources.r1.bind = hadoop02
a2.sources.r1.port = 52020
a2.sources.r1.interceptors = i1
a2.sources.r1.interceptors.i1.type = static
a2.sources.r1.interceptors.i1.key = Collector
# 当前主机为什么，就修改成什么主机名
a2.sources.r1.interceptors.i1.value = hadoop02

#set channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

#set sink to logger
a2.sinks.k1.type=logger

a2.sinks.k1.channel=c1
a2.sources.r1.channels = c1

agent5

#set agent name
a2.sources = r1
a2.channels = c1
a2.sinks = k1

# source
a2.sources.r1.type = avro
## 当前主机是什么，就修改成什么主机名
a2.sources.r1.bind = hadoop03
a2.sources.r1.port = 52020
a2.sources.r1.interceptors = i1
a2.sources.r1.interceptors.i1.type = static
a2.sources.r1.interceptors.i1.key = Collector
# 当前主机是什么，就修改成什么主机名
a2.sources.r1.interceptors.i1.value = hadoop03

#set channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

#set sink to logger
a2.sinks.k1.type=logger

a2.sinks.k1.channel=c1
a2.sources.r1.channels = c1

先启动hadoop02上的agent04

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/agent04.conf --name a2 -Dflume.root.logger=INFO,console

再启动hadoop03上的agent05

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/agent05.conf --name a2 -Dflume.root.logger=INFO,console

最后启动agent123

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/agent01.conf --name a1 -Dflume.root.logger=INFO,console

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/agent02.conf --name a1 -Dflume.root.logger=INFO,console

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/agent03.conf --name a1 -Dflume.root.logger=INFO,console

上一篇：Azkaban的安装与使用

下一篇：Sqoop的使用

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯