文章目录



1.Flume核心组件

1.1source类型


Source 负责接收 event 或通过特殊机制产生 event,并将 events 批量的放到一个或多个


1.1.1netcat

来自于主机的网络端口数据,一旦指定的主机的端口中有数据就会被作为数据源采集

a1.sources.r1.type = netcat
a1.sources.r1.bind = 指定监控的主机
a1.sources.r1.port = 指定数据源的端口
1.1.2exec

数据源来自于一个unix命令执行内容结果,它会对操作文档的内容的命令的执行结果进行监听,常用的命令有cat、tail、head

a1.sources.r1.type=exec
a1.sources.r1.command=tail |cat| head

案例(1.txt文件自行创建):

# a1  当前这个agent的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 指定source 来自于端口的
a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /home/hadoop/datas/1.txt

# 指定channel
a1.channels.c1.type = memory

# 指定sink的类型 logger 控制台打印
a1.sinks.k1.type = logger

# 绑定 channel sink source
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动:

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/ex01_exec.conf --name a1 -Dflume.root.logger=INFO,console
1.1.3Spooling Directory

数据源来自于一个文件夹下的所有文件,比如:/datas 1.txt 2.txt 3.txt

a1.sources.r1.type = spoordir
a1.sources.r1.spoolDir = 指定需要采集数据来源的文件夹

案例(datas文件夹自行创建):

# a1  当前这个agent的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 指定source 来自于端口的
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /home/hadoop/datas
a1.sources.r1.fileSuffix = .finished

# 指定channel
a1.channels.c1.type = memory

# 指定sink的类型 logger 控制台打印
a1.sinks.k1.type = logger

# 绑定 channel sink source
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动:

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/ex02_spool.conf --name a1 -Dflume.root.logger=INFO,console
1.1.4avro

数据源来自于avro port一般用于多agent串联,比如一个agent 向另一个agent 发送数据

agent1–avro port-- agent2

a1.sources.r1.type = avro 
a1.sources.r1.bind = 指定主机 这里的这个主机 avro sink中保持一致
a1.sources.r1.port = 指定端口 这里的这个主机 avro sink中保持一致

案例(conf文件自行创建):

先对agent进行规划


  • agent1 hadoop01 source – netcat channel – memory sink – avro
  • agent2 hadoop02 source – avro source channel–memory sink – logger

agent1 :

# a1  当前这个agent的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 指定source 来自于端口的
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop01
a1.sources.r1.port= 44455

# 指定channel
a1.channels.c1.type = memory

# 指定sink的类型 logger 控制台打印
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop02
a1.sinks.k1.port = 44466

# 绑定 channel sink source
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

agent2:

# a1  当前这个agent的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 指定source
a1.sources.r1.type = avro
a1.sources.r1.bind = hadoop02
a1.sources.r1.port= 44466

# 指定channel
a1.channels.c1.type = memory

# 指定sink的类型 logger 控制台打印
a1.sinks.k1.type = logger

# 绑定 channel sink source
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

先启动hadoop02上的agent2

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/ex03_agent02_avrosource.conf --name a1 -Dflume.root.logger=INFO,console

再启动 hadoop01上的agent1

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/ex03_agent01_avrosink.conf --name a1 -Dflume.root.logger=INFO,console

1.2channel类型


包含 event 驱动和轮询两种类型


1.2.1memory channel
# 数据存储内存中
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000 memory中存储的数据的最大条数
a1.channels.c1.transactionCapacity = 10000 每次提交的数据量
1.2.2file channel

是基于磁盘的

1.2.3jdbc

关于数据库的

1.3sink类型


sink 负责将 event传输到下一跳或最终目的地,成功后将 event 从channel移除


1.3.1avro

接收avro port的数据发送给另一个agent

a1.sinks.k1.type = avro 
a1.sinks.k1.hostname = 指定绑定的主机 一般指定下一个的agenT的主机
a1.sinks.k1.port= 指定数据存放的端口
1.3.2logger

将结果打印到控制台中

1.3.3hdfs

将结果收集到hdfs中

a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = 指定hdfs 的路径

案例(2.txt自行创建):

# a1  当前这个agent的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 指定source 来自于端口的
a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /home/hadoop/datas/2.txt

# 指定channel
a1.channels.c1.type = memory

# 指定sink的类型 logger 控制台打印
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/data

# 绑定 channel sink source
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动:

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/ex04_hdfs_sink.conf --name a1 -Dflume.root.logger=INFO,console

2.拦截器


用来拦截数据源给数据源进行初步处理,可以给数据源添加一个标识。可以在sink端对不同标识的数据进行不同的处理。event{header:{k=v}}


2.1时间戳拦截器

Timestamp Interceptor,拦截数据源给header信息中添加一个时间戳 ,header{timestamp= 142526273}

# 指定拦截器别名
a1.sources.r1.interceptors = i1
# 指定拦截器类型
a1.sources.r1.interceptors.i1.type = timestamp

案例(conf文件内容自拟):

# a1  当前这个agent的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 指定source 来自于端口的
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop01
a1.sources.r1.port = 44477

# 指定source的拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp

# 指定channel
a1.channels.c1.type = memory

# 指定sink的类型 logger 控制台打印
a1.sinks.k1.type = logger

# 绑定 channel sink source
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动:

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/ex05_intc01_time.conf --name a1 -Dflume.root.logger=INFO,console

2.2主机名拦截器

a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = host
a1.sources.r1.interceptors.i1.useIP = false

案例(conf文件内容自拟):

# a1  当前这个agent的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 指定source 来自于端口的
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop01
a1.sources.r1.port = 44477

# 指定source的拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = host
# a1.sources.r1.interceptors.i1.useIP = false

# 指定channel
a1.channels.c1.type = memory

# 指定sink的类型 logger 控制台打印
a1.sinks.k1.type = logger

# 绑定 channel sink source
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动:

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/ex06_host.conf --name a1 -Dflume.root.logger=INFO,console

2.3静态拦截器

Static Interceptor,这个拦截器可以手动指定key==value值的(最常用)

a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = datacenter
a1.sources.r1.interceptors.i1.value = NEW_YORK

案例(conf文件内容自拟):

# a1  当前这个agent的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 指定source 来自于端口的
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop01
a1.sources.r1.port = 44477

# 指定source的拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = type
a1.sources.r1.interceptors.i1.value = netcat-hadoop01-44477

# 指定channel
a1.channels.c1.type = memory

# 指定sink的类型 logger 控制台打印
a1.sinks.k1.type = logger

# 绑定 channel sink source
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动:

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/ex07_static.conf --name a1 -Dflume.root.logger=INFO,console

2.4多个拦截器

案例(conf文件内容自拟):

# a1  当前这个agent的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 指定source 来自于端口的
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop01
a1.sources.r1.port = 44477

# 指定source的拦截器
a1.sources.r1.interceptors = i1 i2 i3
# 设置第一个拦截器
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = type
a1.sources.r1.interceptors.i1.value = netcat-hadoop01-44477

# 设置第二个拦截器
a1.sources.r1.interceptors.i2.type = host

# 设置第三个拦截器
a1.sources.r1.interceptors.i3.type = timestamp


# 指定channel
a1.channels.c1.type = memory

# 指定sink的类型 logger 控制台打印
a1.sinks.k1.type = logger

# 绑定 channel sink source
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动:

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/ex08.conf --name a1 -Dflume.root.logger=INFO,console

综合案例:A、B 两台日志服务机器实时生产日志主要类型为 access.log、nginx.log、web.log

现在要求:把 A、B 机器中的 access.log、nginx.log、web.log 采集汇总到 C 机器上然后统一收集到 hdfs中。

agent1与agent2

# 当前这个agent的名字
a1.sources = r1 r2 r3
a1.sinks = k1
a1.channels = c1

# 指定source
# 指定 r1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /home/hadoop/datas/log/access.log
# 指定r1 对应的拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = logname
a1.sources.r1.interceptors.i1.value = access

# 指定 r2
a1.sources.r2.type = exec
a1.sources.r2.command = tail -f /home/hadoop/datas/log/ngix.log

a1.sources.r2.interceptors = i2
a1.sources.r2.interceptors.i2.type = static
a1.sources.r2.interceptors.i2.key = logname
a1.sources.r2.interceptors.i2.value = ngix

# 指定 r3
a1.sources.r3.type = exec
a1.sources.r3.command = tail -f /home/hadoop/datas/log/web.log
a1.sources.r3.interceptors = i3
a1.sources.r3.interceptors.i3.type = static
a1.sources.r3.interceptors.i3.key = logname
a1.sources.r3.interceptors.i3.value = web

# 指定channel
a1.channels.c1.type = memory

# 指定sink的类型 avro
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop03
a1.sinks.k1.port = 55566


# 绑定 channel sink source
a1.sources.r1.channels = c1
a1.sources.r2.channels = c1
a1.sources.r3.channels = c1
a1.sinks.k1.channel = c1

agent3

# a1  当前这个agent的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 指定source
a1.sources.r1.type = avro
a1.sources.r1.bind = hadoop03
a1.sources.r1.port= 55566

# 指定channel
a1.channels.c1.type = memory

# 指定sink的类型 logger 控制台打印
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /source/log/%{logname}/%Y%m%d
a1.sinks.k1.hdfs.useLocalTimeStamp = true
# 回滚条件
a1.sinks.k1.hdfs.rollSize = 10240
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.rollCount =0
a1.sinks.k1.hdfs.idleTimeout = 30

# 文件输出格式
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text

# 绑定 channel sink source
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

在hadoop03上启动:

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/agent03_zh.conf --name a1 -Dflume.root.logger=INFO,console

在hadoop02上启动:

../bin/flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/agent02_zh.conf --name a1 -Dflume.root.logger=INFO,console

在hadoop01上启动:

../bin/flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/agent01_zh.conf --name a1 -Dflume.root.logger=INFO,console

3.高可用配置

架构

agent

hadoop01

agent1

hadoop02

agent2、agent4

hadoop03

agent3、agent5

描述:123收集数据,4汇总,5备份

agent1、agent2、agent3

#agent name: agent1
a1.sources = r1
a1.channels = c1
a1.sinks = k1 k2

# 设置source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /home/hadoop/datas/log/web.log
a1.sources.r1.interceptors = i1 i2
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = Type
a1.sources.r1.interceptors.i1.value = LOGIN
a1.sources.r1.interceptors.i2.type = timestamp

# 设置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 设置sink
# 将多个sink放在一个组中 组名
a1.sinkgroups = g1
# set k1
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop02
a1.sinks.k1.port = 52020

# set k2
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop03
a1.sinks.k2.port = 52020

#设置sink的 优先级 高可用
a1.sinkgroups.g1.sinks = k1 k2
#set设置失败切换
a1.sinkgroups.g1.processor.type = failover
# 设置优先级 1-10 越高 优先
a1.sinkgroups.g1.processor.priority.k1 = 10
a1.sinkgroups.g1.processor.priority.k2 = 1
# 时间间隔
a1.sinkgroups.g1.processor.maxpenalty = 10000
# 设置 绑定关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

agent4

#set agent name
a2.sources = r1
a2.channels = c1
a2.sinks = k1

# source
a2.sources.r1.type = avro
## 当前主机为什么,就修改成什么主机名
a2.sources.r1.bind = hadoop02
a2.sources.r1.port = 52020
a2.sources.r1.interceptors = i1
a2.sources.r1.interceptors.i1.type = static
a2.sources.r1.interceptors.i1.key = Collector
# 当前主机为什么,就修改成什么主机名
a2.sources.r1.interceptors.i1.value = hadoop02

#set channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

#set sink to logger
a2.sinks.k1.type=logger

a2.sinks.k1.channel=c1
a2.sources.r1.channels = c1

agent5

#set agent name
a2.sources = r1
a2.channels = c1
a2.sinks = k1

# source
a2.sources.r1.type = avro
## 当前主机是什么,就修改成什么主机名
a2.sources.r1.bind = hadoop03
a2.sources.r1.port = 52020
a2.sources.r1.interceptors = i1
a2.sources.r1.interceptors.i1.type = static
a2.sources.r1.interceptors.i1.key = Collector
# 当前主机是什么,就修改成什么主机名
a2.sources.r1.interceptors.i1.value = hadoop03

#set channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

#set sink to logger
a2.sinks.k1.type=logger

a2.sinks.k1.channel=c1
a2.sources.r1.channels = c1

先启动hadoop02上的agent04

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/agent04.conf --name a2 -Dflume.root.logger=INFO,console

再启动hadoop03上的agent05

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/agent05.conf --name a2 -Dflume.root.logger=INFO,console

最后启动agent123

./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/agent01.conf --name a1 -Dflume.root.logger=INFO,console
./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/agent02.conf --name a1 -Dflume.root.logger=INFO,console
./flume-ng agent --conf conf --conf-file /home/hadoop/apps/apache-flume-1.8.0-bin/conf/agent03.conf --name a1 -Dflume.root.logger=INFO,console