文章目录
- 2.拦截器
1.Flume核心组件
1.1source类型
Source 负责接收 event 或通过特殊机制产生 event,并将 events 批量的放到一个或多个
1.1.1netcat
来自于主机的网络端口数据,一旦指定的主机的端口中有数据就会被作为数据源采集
a1.sources.r1.type = netcat
a1.sources.r1.bind = 指定监控的主机
a1.sources.r1.port = 指定数据源的端口
1.1.2exec
数据源来自于一个unix命令执行内容结果,它会对操作文档的内容的命令的执行结果进行监听,常用的命令有cat、tail、head
a1.sources.r1.type=exec
a1.sources.r1.command=tail |cat| head
案例(1.txt文件自行创建):
# a1 当前这个agent的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# 指定source 来自于端口的
a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /home/hadoop/datas/1.txt
# 指定channel
a1.channels.c1.type = memory
# 指定sink的类型 logger 控制台打印
a1.sinks.k1.type = logger
# 绑定 channel sink source
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
启动:
./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/ex01_exec.conf --name a1 -Dflume.root.logger=INFO,console
1.1.3Spooling Directory
数据源来自于一个文件夹下的所有文件,比如:/datas 1.txt 2.txt 3.txt
a1.sources.r1.type = spoordir
a1.sources.r1.spoolDir = 指定需要采集数据来源的文件夹
案例(datas文件夹自行创建):
# a1 当前这个agent的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# 指定source 来自于端口的
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /home/hadoop/datas
a1.sources.r1.fileSuffix = .finished
# 指定channel
a1.channels.c1.type = memory
# 指定sink的类型 logger 控制台打印
a1.sinks.k1.type = logger
# 绑定 channel sink source
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
启动:
./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/ex02_spool.conf --name a1 -Dflume.root.logger=INFO,console
1.1.4avro
数据源来自于avro port一般用于多agent串联,比如一个agent 向另一个agent 发送数据
agent1–avro port-- agent2
a1.sources.r1.type = avro
a1.sources.r1.bind = 指定主机 这里的这个主机 avro sink中保持一致
a1.sources.r1.port = 指定端口 这里的这个主机 avro sink中保持一致
案例(conf文件自行创建):
先对agent进行规划
- agent1 hadoop01 source – netcat channel – memory sink – avro
- agent2 hadoop02 source – avro source channel–memory sink – logger
agent1 :
# a1 当前这个agent的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# 指定source 来自于端口的
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop01
a1.sources.r1.port= 44455
# 指定channel
a1.channels.c1.type = memory
# 指定sink的类型 logger 控制台打印
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop02
a1.sinks.k1.port = 44466
# 绑定 channel sink source
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
agent2:
# a1 当前这个agent的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# 指定source
a1.sources.r1.type = avro
a1.sources.r1.bind = hadoop02
a1.sources.r1.port= 44466
# 指定channel
a1.channels.c1.type = memory
# 指定sink的类型 logger 控制台打印
a1.sinks.k1.type = logger
# 绑定 channel sink source
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
先启动hadoop02上的agent2
./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/ex03_agent02_avrosource.conf --name a1 -Dflume.root.logger=INFO,console
再启动 hadoop01上的agent1
./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/ex03_agent01_avrosink.conf --name a1 -Dflume.root.logger=INFO,console
1.2channel类型
包含 event 驱动和轮询两种类型
1.2.1memory channel
# 数据存储内存中
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000 memory中存储的数据的最大条数
a1.channels.c1.transactionCapacity = 10000 每次提交的数据量
1.2.2file channel
是基于磁盘的
1.2.3jdbc
关于数据库的
1.3sink类型
sink 负责将 event传输到下一跳或最终目的地,成功后将 event 从channel移除
1.3.1avro
接收avro port的数据发送给另一个agent
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = 指定绑定的主机 一般指定下一个的agenT的主机
a1.sinks.k1.port= 指定数据存放的端口
1.3.2logger
将结果打印到控制台中
1.3.3hdfs
将结果收集到hdfs中
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = 指定hdfs 的路径
案例(2.txt自行创建):
# a1 当前这个agent的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# 指定source 来自于端口的
a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /home/hadoop/datas/2.txt
# 指定channel
a1.channels.c1.type = memory
# 指定sink的类型 logger 控制台打印
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/data
# 绑定 channel sink source
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
启动:
./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/ex04_hdfs_sink.conf --name a1 -Dflume.root.logger=INFO,console
2.拦截器
用来拦截数据源给数据源进行初步处理,可以给数据源添加一个标识。可以在sink端对不同标识的数据进行不同的处理。event{header:{k=v}}
2.1时间戳拦截器
Timestamp Interceptor,拦截数据源给header信息中添加一个时间戳 ,header{timestamp= 142526273}
# 指定拦截器别名
a1.sources.r1.interceptors = i1
# 指定拦截器类型
a1.sources.r1.interceptors.i1.type = timestamp
案例(conf文件内容自拟):
# a1 当前这个agent的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# 指定source 来自于端口的
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop01
a1.sources.r1.port = 44477
# 指定source的拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp
# 指定channel
a1.channels.c1.type = memory
# 指定sink的类型 logger 控制台打印
a1.sinks.k1.type = logger
# 绑定 channel sink source
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
启动:
./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/ex05_intc01_time.conf --name a1 -Dflume.root.logger=INFO,console
2.2主机名拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = host
a1.sources.r1.interceptors.i1.useIP = false
案例(conf文件内容自拟):
# a1 当前这个agent的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# 指定source 来自于端口的
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop01
a1.sources.r1.port = 44477
# 指定source的拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = host
# a1.sources.r1.interceptors.i1.useIP = false
# 指定channel
a1.channels.c1.type = memory
# 指定sink的类型 logger 控制台打印
a1.sinks.k1.type = logger
# 绑定 channel sink source
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
启动:
./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/ex06_host.conf --name a1 -Dflume.root.logger=INFO,console
2.3静态拦截器
Static Interceptor,这个拦截器可以手动指定key==value值的(最常用)
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = datacenter
a1.sources.r1.interceptors.i1.value = NEW_YORK
案例(conf文件内容自拟):
# a1 当前这个agent的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# 指定source 来自于端口的
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop01
a1.sources.r1.port = 44477
# 指定source的拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = type
a1.sources.r1.interceptors.i1.value = netcat-hadoop01-44477
# 指定channel
a1.channels.c1.type = memory
# 指定sink的类型 logger 控制台打印
a1.sinks.k1.type = logger
# 绑定 channel sink source
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
启动:
./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/ex07_static.conf --name a1 -Dflume.root.logger=INFO,console
2.4多个拦截器
案例(conf文件内容自拟):
# a1 当前这个agent的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# 指定source 来自于端口的
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop01
a1.sources.r1.port = 44477
# 指定source的拦截器
a1.sources.r1.interceptors = i1 i2 i3
# 设置第一个拦截器
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = type
a1.sources.r1.interceptors.i1.value = netcat-hadoop01-44477
# 设置第二个拦截器
a1.sources.r1.interceptors.i2.type = host
# 设置第三个拦截器
a1.sources.r1.interceptors.i3.type = timestamp
# 指定channel
a1.channels.c1.type = memory
# 指定sink的类型 logger 控制台打印
a1.sinks.k1.type = logger
# 绑定 channel sink source
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
启动:
./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/ex08.conf --name a1 -Dflume.root.logger=INFO,console
综合案例:A、B 两台日志服务机器实时生产日志主要类型为 access.log、nginx.log、web.log
现在要求:把 A、B 机器中的 access.log、nginx.log、web.log 采集汇总到 C 机器上然后统一收集到 hdfs中。
agent1与agent2
# 当前这个agent的名字
a1.sources = r1 r2 r3
a1.sinks = k1
a1.channels = c1
# 指定source
# 指定 r1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /home/hadoop/datas/log/access.log
# 指定r1 对应的拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = logname
a1.sources.r1.interceptors.i1.value = access
# 指定 r2
a1.sources.r2.type = exec
a1.sources.r2.command = tail -f /home/hadoop/datas/log/ngix.log
a1.sources.r2.interceptors = i2
a1.sources.r2.interceptors.i2.type = static
a1.sources.r2.interceptors.i2.key = logname
a1.sources.r2.interceptors.i2.value = ngix
# 指定 r3
a1.sources.r3.type = exec
a1.sources.r3.command = tail -f /home/hadoop/datas/log/web.log
a1.sources.r3.interceptors = i3
a1.sources.r3.interceptors.i3.type = static
a1.sources.r3.interceptors.i3.key = logname
a1.sources.r3.interceptors.i3.value = web
# 指定channel
a1.channels.c1.type = memory
# 指定sink的类型 avro
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop03
a1.sinks.k1.port = 55566
# 绑定 channel sink source
a1.sources.r1.channels = c1
a1.sources.r2.channels = c1
a1.sources.r3.channels = c1
a1.sinks.k1.channel = c1
agent3
# a1 当前这个agent的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# 指定source
a1.sources.r1.type = avro
a1.sources.r1.bind = hadoop03
a1.sources.r1.port= 55566
# 指定channel
a1.channels.c1.type = memory
# 指定sink的类型 logger 控制台打印
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /source/log/%{logname}/%Y%m%d
a1.sinks.k1.hdfs.useLocalTimeStamp = true
# 回滚条件
a1.sinks.k1.hdfs.rollSize = 10240
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.rollCount =0
a1.sinks.k1.hdfs.idleTimeout = 30
# 文件输出格式
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text
# 绑定 channel sink source
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
在hadoop03上启动:
./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/agent03_zh.conf --name a1 -Dflume.root.logger=INFO,console
在hadoop02上启动:
../bin/flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/agent02_zh.conf --name a1 -Dflume.root.logger=INFO,console
在hadoop01上启动:
../bin/flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/agent01_zh.conf --name a1 -Dflume.root.logger=INFO,console
3.高可用配置
架构
agent | |
hadoop01 | agent1 |
hadoop02 | agent2、agent4 |
hadoop03 | agent3、agent5 |
描述:123收集数据,4汇总,5备份
agent1、agent2、agent3
#agent name: agent1
a1.sources = r1
a1.channels = c1
a1.sinks = k1 k2
# 设置source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /home/hadoop/datas/log/web.log
a1.sources.r1.interceptors = i1 i2
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = Type
a1.sources.r1.interceptors.i1.value = LOGIN
a1.sources.r1.interceptors.i2.type = timestamp
# 设置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 设置sink
# 将多个sink放在一个组中 组名
a1.sinkgroups = g1
# set k1
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop02
a1.sinks.k1.port = 52020
# set k2
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop03
a1.sinks.k2.port = 52020
#设置sink的 优先级 高可用
a1.sinkgroups.g1.sinks = k1 k2
#set设置失败切换
a1.sinkgroups.g1.processor.type = failover
# 设置优先级 1-10 越高 优先
a1.sinkgroups.g1.processor.priority.k1 = 10
a1.sinkgroups.g1.processor.priority.k2 = 1
# 时间间隔
a1.sinkgroups.g1.processor.maxpenalty = 10000
# 设置 绑定关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1
agent4
#set agent name
a2.sources = r1
a2.channels = c1
a2.sinks = k1
# source
a2.sources.r1.type = avro
## 当前主机为什么,就修改成什么主机名
a2.sources.r1.bind = hadoop02
a2.sources.r1.port = 52020
a2.sources.r1.interceptors = i1
a2.sources.r1.interceptors.i1.type = static
a2.sources.r1.interceptors.i1.key = Collector
# 当前主机为什么,就修改成什么主机名
a2.sources.r1.interceptors.i1.value = hadoop02
#set channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
#set sink to logger
a2.sinks.k1.type=logger
a2.sinks.k1.channel=c1
a2.sources.r1.channels = c1
agent5
#set agent name
a2.sources = r1
a2.channels = c1
a2.sinks = k1
# source
a2.sources.r1.type = avro
## 当前主机是什么,就修改成什么主机名
a2.sources.r1.bind = hadoop03
a2.sources.r1.port = 52020
a2.sources.r1.interceptors = i1
a2.sources.r1.interceptors.i1.type = static
a2.sources.r1.interceptors.i1.key = Collector
# 当前主机是什么,就修改成什么主机名
a2.sources.r1.interceptors.i1.value = hadoop03
#set channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
#set sink to logger
a2.sinks.k1.type=logger
a2.sinks.k1.channel=c1
a2.sources.r1.channels = c1
先启动hadoop02上的agent04
./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/agent04.conf --name a2 -Dflume.root.logger=INFO,console
再启动hadoop03上的agent05
./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/agent05.conf --name a2 -Dflume.root.logger=INFO,console
最后启动agent123
./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/agent01.conf --name a1 -Dflume.root.logger=INFO,console
./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/agent02.conf --name a1 -Dflume.root.logger=INFO,console
./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/agent03.conf --name a1 -Dflume.root.logger=INFO,console