cdh flume 使用 flume使用教程

转载

桃太郎 2024-02-19 10:34:17

文章标签 cdh flume 使用 hdfs ci .net 文章分类 架构后端开发

flume的基本使用

1 flume的基本运行方式

1.1 通过端口来进行获取流
1.2 通过监控文件来进行获取流
1.3 通过监控文件夹

2 flume的多对一和一对多

2.1 多对一，ManyToOne

flume1
flume2
flume3

2.2 一对多，OneToMany

flume1
flume2
flume3

3 flume 拦截器（interceptors）

3.1 对文件名添加时间
3.2 文件夹添加主机名
3.3 给头文件添加 UUID
3.4 正则查询替换拦截器
3.5 正则表达式Regex
3.6自定义拦截器

1 flume的基本运行方式

1.1 通过端口来进行获取流

flume-ng agent \
--conf $FLUME_HOME/conf/ \
--name NetCat \
--conf-file $FLUME_HOME/jobconf/flume-telnet.conf \
-Dflume.root.logger==INFO,console

输入类型：netcat（端口监听）

输出类型：logs端口打印（#号备注了连接到本地文件）

sources	--->	type=netcat		bind=bigdata:112	port=44445
sinks	--->	type=logger(# file_roll)
channels--->	type = memory	capacity = 1000		transactionCapacity = 100

#1.定义Agent=>NetCat
NetCat.sources=netcatSources
NetCat.sinks = netcatSinks
NetCat.channels = netcatChannels

#2.定义source
NetCat.sources.netcatSources.type = netcat
NetCat.sources.netcatSources.bind = bigdata112
NetCat.sources.netcatSources.port = 44445

#3.定义sink
NetCat.sinks.netcatSinks.type = logger

# 连接到本地文件夹
# NetCat.sinks.netcatSinks.type = file_roll
# 备注：此处的文件夹需要先创建好
# NetCat.sinks.netcatSinks.sink.directory = /opt/test/flume3

#4.定义channel
NetCat.channels.netcatChannels.type = memory
NetCat.channels.netcatChannels.capacity = 1000
NetCat.channels.netcatChannels.transactionCapacity = 100

#5.双向链接
NetCat.sources.netcatSources.channels = netcatChannels
NetCat.sinks.netcatSinks.channel = netcatChannels

1.2 通过监控文件来进行获取流

flume-ng agent \
--conf $FLUME_HOME/conf/ \
--name Exec \
--conf-file $FLUME_HOME/jobconf/flume-hdfs.conf \
-Dflume.root.logger==INFO,console

输入类型：exec（端口文件）

输出类型：HDFS

sources	--->	type=exec	command = tail -F /opt/Andy		shell = /bin/bash -c
sinks	--->	type=hdfs	path = hdfs://bigdata111:9000/flume/%H	...
channels--->	type = memory	capacity = 1000		transactionCapacity = 1000

#1.定义Agent=>Exec
Exec.sources=execSources
Exec.sinks = execSinks
Exec.channels = execChannels

#2.定义source
Exec.sources.execSources.type = exec
Exec.sources.execSources.command = tail -F /opt/Andy
Exec.sources.execSources.shell = /bin/bash -c

#3.定义sink
Exec.sinks.execSinks.type = hdfs
Exec.sinks.execSinks.hdfs.path = hdfs://bigdata111:9000/flume/%H
#上传文件的前缀
Exec.sinks.execSinks.hdfs.filePrefix = logs-
#是否按照时间滚动文件夹
Exec.sinks.execSinks.hdfs.round = true
#多少时间单位创建一个新的文件夹
Exec.sinks.execSinks.hdfs.roundValue = 1
#重新定义时间单位
Exec.sinks.execSinks.hdfs.roundUnit = hour
#是否使用本地时间戳
Exec.sinks.execSinks.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
Exec.sinks.execSinks.hdfs.batchSize = 100
#设置文件类型，可支持压缩
Exec.sinks.execSinks.hdfs.fileType = DataStream
#多久生成一个新的文件
Exec.sinks.execSinks.hdfs.rollInterval = 600
#设置每个文件的滚动大小
Exec.sinks.execSinks.hdfs.rollSize = 134217728
#文件的滚动与Event数量无关
Exec.sinks.execSinks.hdfs.rollCount = 0
#最小副本数
Exec.sinks.execSinks.hdfs.minBlockReplicas = 1

#4.定义channel
Exec.channels.execChannels.type = memory
Exec.channels.execChannels.capacity = 1000
Exec.channels.execChannels.transactionCapacity = 1000

#5.双向链接
Exec.sources.execSources.channels = execChannels
Exec.sinks.execSinks.channel = execChannels

1.3 通过监控文件夹

flume-ng agent \
--conf $FLUME_HOME/conf/ \
--name Spooldir \
--conf-file $FLUME_HOME/jobconf/flume-spooldir.conf \
-Dflume.root.logger==INFO,console

sources	--->	type=spooldir
sinks	--->	type=hdfs	path = hdfs://bigdata111:9000/flume/%H	...
channels--->	type = memory	capacity = 1000		transactionCapacity = 1000

输入类型：spooldir（端口文件夹）

输出类型：HDFS

#1.定义Agent=>Spooldir
Spooldir.sources=spooldirSources
Spooldir.sinks = spooldirSinks
Spooldir.channels = spooldirChannels

#2.定义source
Spooldir.sources.spooldirSources.type = spooldir
Spooldir.sources.spooldirSources.spoolDir = /opt/module/apache-flume-1.9.0-bin/flume_logs
Spooldir.sources.spooldirSources.fileSuffix = .COMPLETED
Spooldir.sources.spooldirSources.fileHeader = true
#忽略所有以.tmp结尾的文件，不上传
Spooldir.sources.spooldirSources.ignorePattern = ([^ ]*\.tmp)

#3.定义sink
Spooldir.sinks.spooldirSinks.type = hdfs
Spooldir.sinks.spooldirSinks.hdfs.path = hdfs://bigdata111:9000/flume/%H
#上传文件的前缀
Spooldir.sinks.spooldirSinks.hdfs.filePrefix = upload-
#是否按照时间滚动文件夹
Spooldir.sinks.spooldirSinks.hdfs.round = true
#多少时间单位创建一个新的文件夹
Spooldir.sinks.spooldirSinks.hdfs.roundValue = 1
#重新定义时间单位
Spooldir.sinks.spooldirSinks.hdfs.roundUnit = hour
#是否使用本地时间戳
Spooldir.sinks.spooldirSinks.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
Spooldir.sinks.spooldirSinks.hdfs.batchSize = 1000
#设置文件类型，可支持压缩
Spooldir.sinks.spooldirSinks.hdfs.fileType = DataStream
#多久生成一个新的文件
Spooldir.sinks.spooldirSinks.hdfs.rollInterval = 600
#设置每个文件的滚动大小
Spooldir.sinks.spooldirSinks.hdfs.rollSize = 134217728
#文件的滚动与Event数量无关
Spooldir.sinks.spooldirSinks.hdfs.rollCount = 0
#最小副本数
Spooldir.sinks.spooldirSinks.hdfs.minBlockReplicas = 1

#4.定义channel
Spooldir.channels.spooldirChannels.type = memory
Spooldir.channels.spooldirChannels.capacity = 1000
Spooldir.channels.spooldirChannels.transactionCapacity = 1000

#5.双向链接
Spooldir.sources.spooldirSources.channels = spooldirChannels
Spooldir.sinks.spooldirSinks.channel = spooldirChannels

2 flume的多对一和一对多

2.1 多对一，ManyToOne

flume1（监听文件）,flume2（监听端口）-通过avro使得数据输出到统一端口-> flume3（两个合并输出到hdfs）

flume1

监听文件

flume-ng agent \
--conf $FLUME_HOME/conf/ \
--name ManyToOne1 \
--conf-file $FLUME_HOME/jobconf/ManyToOne/flume1.conf \
-Dflume.root.logger==INFO,console

输入类型：文件监听

输出类型：avro（通过指定端口号来对应下一个flume输入来进行链接）

# agent==>ManyToOne1
ManyToOne1.sources =execSources
ManyToOne1.sinks = execSinks
ManyToOne1.channels = execChannels

# Describe/configure the source
ManyToOne1.sources.execSources.type = exec
ManyToOne1.sources.execSources.command = tail -F /opt/Andy
ManyToOne1.sources.execSources.shell = /bin/bash -c

# Describe the sink
ManyToOne1.sinks.execSinks.type = avro
ManyToOne1.sinks.execSinks.hostname = bigdata111
ManyToOne1.sinks.execSinks.port = 4141

# Describe the channel
ManyToOne1.channels.execChannels.type = memory
ManyToOne1.channels.execChannels.capacity = 1000
ManyToOne1.channels.execChannels.transactionCapacity = 100

# Bind the source and sink to the channel
ManyToOne1.sources.execSources.channels = execChannels
ManyToOne1.sinks.execSinks.channel = execChannels

flume2

监听端口

flume-ng agent \
--conf $FLUME_HOME/conf/ \
--name ManyToOne2 \
--conf-file $FLUME_HOME/jobconf/ManyToOne/flume2.conf \
-Dflume.root.logger==INFO,console

输入类型：netcat（端口）

输出类型：avro（通过指定端口号来对应下一个flume输入来进行链接）

#1.定义Agent=>ManyToOne2
ManyToOne2.sources=netcatSources
ManyToOne2.sinks = netcatSinks
ManyToOne2.channels = netcatChannels

#2.定义source
ManyToOne2.sources.netcatSources.type = netcat
ManyToOne2.sources.netcatSources.bind = bigdata111
ManyToOne2.sources.netcatSources.port = 44445

# 通过avro输出流
ManyToOne2.sinks.netcatSinks.type = avro
ManyToOne2.sinks.netcatSinks.hostname = bigdata111
ManyToOne2.sinks.netcatSinks.port = 4141

# Describe the channel
ManyToOne2.channels.netcatChannels.type = memory
ManyToOne2.channels.netcatChannels.capacity = 1000
ManyToOne2.channels.netcatChannels.transactionCapacity = 100

# Bind the source and sink to the channel
ManyToOne2.sources.netcatSources.channels = netcatChannels
ManyToOne2.sinks.netcatSinks.channel = netcatChannels

flume3

聚合

flume-ng agent \
--conf $FLUME_HOME/conf/ \
--name ManyToOne3 \
--conf-file $FLUME_HOME/jobconf/ManyToOne/flume3.conf \
-Dflume.root.logger==INFO,console

输入类型：avro端口

输出类型：HDFS

# 1.定义Agent=>ManyToOne3
ManyToOne3.sources = many
ManyToOne3.sinks = one
ManyToOne3.channels = to

# Describe/configure the source
ManyToOne3.sources.many.type = avro
ManyToOne3.sources.many.bind = bigdata111
ManyToOne3.sources.many.port = 4141

# Describe the sink
ManyToOne3.sinks.one.type = hdfs
ManyToOne3.sinks.one.hdfs.path = hdfs://bigdata111:9000/flume3/%H
#上传文件的前缀
ManyToOne3.sinks.one.hdfs.filePrefix = flume3-
#是否按照时间滚动文件夹
ManyToOne3.sinks.one.hdfs.round = true
#多少时间单位创建一个新的文件夹
ManyToOne3.sinks.one.hdfs.roundValue = 1
#重新定义时间单位
ManyToOne3.sinks.one.hdfs.roundUnit = hour
#是否使用本地时间戳
ManyToOne3.sinks.one.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
ManyToOne3.sinks.one.hdfs.batchSize = 100
#设置文件类型，可支持压缩
ManyToOne3.sinks.one.hdfs.fileType = DataStream
#多久生成一个新的文件
ManyToOne3.sinks.one.hdfs.rollInterval = 600
#设置每个文件的滚动大小大概是128M
ManyToOne3.sinks.one.hdfs.rollSize = 134217728
#文件的滚动与Event数量无关
ManyToOne3.sinks.one.hdfs.rollCount = 0
#最小冗余数
ManyToOne3.sinks.one.hdfs.minBlockReplicas = 1

# Describe the channel
ManyToOne3.channels.to.type = memory
ManyToOne3.channels.to.capacity = 1000
ManyToOne3.channels.to.transactionCapacity = 100

# Bind the source and sink to the channel
ManyToOne3.sources.many.channels = to
ManyToOne3.sinks.one.channel = to

2.2 一对多，OneToMany

flume1-通过avro输出到两个端口->flume2（AVro端口1）,flume3（AVro端口2）

flume1

flume-ng agent \
--conf $FLUME_HOME/conf/ \
--name OneToMany1 \
--conf-file $FLUME_HOME/jobconf/OneToMany/flume1.conf \
-Dflume.root.logger==INFO,console

输入类型：exec（监控文件）

输出类型：多个avro端口

# agent==>OneToMany
OneToMany1.sources = one
OneToMany1.sinks = many1 many2
OneToMany1.channels = to1 to2
# 将数据流复制给多个channel
OneToMany.sources.one.selector.type = replicating

# Describe/configure the source
OneToMany1.sources.one.type = exec
OneToMany1.sources.one.command = tail -F /opt/Andy
OneToMany1.sources.one.shell = /bin/bash -c

# 通过avro序列化模式，端口输出
OneToMany1.sinks.many1.type = avro
OneToMany1.sinks.many1.hostname = bigdata111
OneToMany1.sinks.many1.port = 4141

OneToMany1.sinks.many2.type = avro
OneToMany1.sinks.many2.hostname = bigdata111
OneToMany1.sinks.many2.port = 4142

# Describe the channel
OneToMany1.channels.to1.type = memory
OneToMany1.channels.to1.capacity = 1000
OneToMany1.channels.to1.transactionCapacity = 100

OneToMany1.channels.to2.type = memory
OneToMany1.channels.to2.capacity = 1000
OneToMany1.channels.to2.transactionCapacity = 100

# 通过channels连接sinks和sources
OneToMany1.sources.one.channels = to1 to2
OneToMany1.sinks.many1.channel = to1
OneToMany1.sinks.many2.channel = to2

flume2

flume-ng agent \
--conf $FLUME_HOME/conf/ \
--name OneToMany2 \
--conf-file $FLUME_HOME/jobconf/OneToMany/flume2.conf \
-Dflume.root.logger==INFO,console

输入类型：avro端口

输出类型：hdfs

# agent==>OneToMany2
OneToMany2.sources = avroSource
OneToMany2.sinks = avroSinks
OneToMany2.channels = avroChannels

# 创建avroSource连接4141
OneToMany2.sources.avroSource.type = avro
OneToMany2.sources.avroSource.bind = bigdata111
OneToMany2.sources.avroSource.port = 4141

# Describe the sink
OneToMany2.sinks.avroSinks.type = hdfs
OneToMany2.sinks.avroSinks.hdfs.path = hdfs://bigdata111:9000/flume2/%H
#上传文件的前缀
OneToMany2.sinks.avroSinks.hdfs.filePrefix = flume2-
#是否按照时间滚动文件夹
OneToMany2.sinks.avroSinks.hdfs.round = true
#多少时间单位创建一个新的文件夹
OneToMany2.sinks.avroSinks.hdfs.roundValue = 1
#重新定义时间单位
OneToMany2.sinks.avroSinks.hdfs.roundUnit = hour
#是否使用本地时间戳
OneToMany2.sinks.avroSinks.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
OneToMany2.sinks.avroSinks.hdfs.batchSize = 100
#设置文件类型，可支持压缩
OneToMany2.sinks.avroSinks.hdfs.fileType = DataStream
#多久生成一个新的文件
OneToMany2.sinks.avroSinks.hdfs.rollInterval = 600
#设置每个文件的滚动大小大概是128M
OneToMany2.sinks.avroSinks.hdfs.rollSize = 134217700
#文件的滚动与Event数量无关
OneToMany2.sinks.avroSinks.hdfs.rollCount = 0
#最小副本数
OneToMany2.sinks.avroSinks.hdfs.minBlockReplicas = 1


# Describe the channel
OneToMany2.channels.avroChannels.type = memory
OneToMany2.channels.avroChannels.capacity = 1000
OneToMany2.channels.avroChannels.transactionCapacity = 100

# Bind the source and sink to the channel
OneToMany2.sources.avroSource.channels = avroChannels
OneToMany2.sinks.avroSinks.channel = avroChannels

flume3

flume-ng agent \
--conf $FLUME_HOME/conf/ \
--name OneToMany3 \
--conf-file $FLUME_HOME/jobconf/OneToMany/flume3.conf \
-Dflume.root.logger==INFO,console

输入类型：avro

输出类型：直接到本地文件

# agent==>OneToMany3
OneToMany3.sources = avroSource
OneToMany3.sinks = avroSinks
OneToMany3.channels = avroChannels

# 创建avroSource连接4142
OneToMany3.sources.avroSource.type = avro
OneToMany3.sources.avroSource.bind = bigdata111
OneToMany3.sources.avroSource.port = 4142

# 连接到本地文件夹
OneToMany3.sinks.avroSinks.type = file_roll
#备注：此处的文件夹需要先创建好
OneToMany3.sinks.avroSinks.sink.directory = /opt/test/flume3

# Describe the channel
OneToMany3.channels.avroChannels.type = memory
OneToMany3.channels.avroChannels.capacity = 1000
OneToMany3.channels.avroChannels.transactionCapacity = 100

# Bind the source and sink to the channel
OneToMany3.sources.avroSource.channels = avroChannels
OneToMany3.sinks.avroSinks.channel = avroChannels

3 flume 拦截器（interceptors）

拦截器是对数据进行操作，不仅仅是在做数据是筛选，删减，还可以增加，而且对也可以对文件名进行操作
拦截器是在source和channel之间运行

3.1 对文件名添加时间

flume-ng agent \
-n Timestamp \
-f $FLUME_HOME/jobconf/interceptors/flume-Timestamp.conf \
-c $FLUME_HOME/conf \
-Dflume.root.logger=INFO,console

输入：spooldir（监控文件夹）

拦截器：给文件名添加时间轴

输出：HDFS

#定义agent名， source、channel、sink的名称
Timestamp.sources = spooldirSources
Timestamp.channels = spooldirChannels
Timestamp.sinks = spooldirSinks

#具体定义source
Timestamp.sources.spooldirSources.type = spooldir
Timestamp.sources.spooldirSources.spoolDir = /opt/module/apache-flume-1.9.0-bin/flume_logs

#定义拦截器，为文件最后添加时间戳
Timestamp.sources.spooldirSources.interceptors = i1
Timestamp.sources.spooldirSources.interceptors.i1.type = org.apache.flume.interceptor.TimestampstampInterceptor$Builder

#具体定义channel
Timestamp.channels.spooldirChannels.type = memory
Timestamp.channels.spooldirChannels.capacity = 10000
Timestamp.channels.spooldirChannels.transactionCapacity = 100

#具体定义sink
Timestamp.sinks.spooldirSinks.type = hdfs
Timestamp.sinks.spooldirSinks.hdfs.path = hdfs://bigdata111:9000/flume-interceptors/%H
Timestamp.sinks.spooldirSinks.hdfs.filePrefix = events-
Timestamp.sinks.spooldirSinks.hdfs.fileType = DataStream

#不按照条数生成文件
Timestamp.sinks.spooldirSinks.hdfs.rollCount = 0
#HDFS上的文件达到128M时生成一个文件
Timestamp.sinks.spooldirSinks.hdfs.rollSize = 134217728
#HDFS上的文件达到60秒生成一个文件
Timestamp.sinks.spooldirSinks.hdfs.rollInterval = 60

#组装source、channel、sink
Timestamp.sources.spooldirSources.channels = spooldirChannels
Timestamp.sinks.spooldirSinks.channel = spooldirChannels

3.2 文件夹添加主机名

flume-ng agent \
-n Host \
-f $FLUME_HOME/jobconf/interceptors/flume-HostName.conf \
-c $FLUME_HOME/conf \
-Dflume.root.logger=INFO,console

输入：spooldir（监控文件夹）

拦截器：给文件名添加主机名

输出：HDFS

#定义agent名， source、channel、sink的名称
Host.sources = spooldirSources
Host.channels = spooldirChannels
Host.sinks = spooldirSinks

#具体定义source	
Host.sources.spooldirSources.type = spooldir
Host.sources.spooldirSources.spoolDir = /opt/module/apache-flume-1.9.0-bin/flume_logs

#定义拦截器，为文件最后添加时间戳
Host.sources.spooldirSources.interceptors = i1
Host.sources.spooldirSources.interceptors.i1.type = host
Host.sources.spooldirSources.interceptors.i1.useIP = false
Host.sources.spooldirSources.interceptors.i1.hostHeader= agentHost

#具体定义channel
Host.channels.spooldirChannels.type = memory
Host.channels.spooldirChannels.capacity = 1000
Host.channels.spooldirChannels.transactionCapacity = 100

#具体定义sink
Host.sinks.spooldirSinks.type = hdfs
Host.sinks.spooldirSinks.hdfs.path = hdfs://bigdata111:9000/flume-interceptors/%H
#	文件前缀
Host.sinks.spooldirSinks.hdfs.filePrefix =events-%{agentHost}
#	文件后缀
Host.sinks.spooldirSinks.hdfs.fileSuffix =.log
Host.sinks.spooldirSinks.hdfs.fileType = DataStream
Host.sinks.spooldirSinks.hdfs.writeFormat = Text
Host.sinks.spooldirSinks.hdfs.useLocalTimeStamp = true
#不按照条数生成文件
Host.sinks.spooldirSinks.hdfs.rollCount = 0
#HDFS上的文件达到128M时生成一个文件
Host.sinks.spooldirSinks.hdfs.rollSize = 134217728
#HDFS上的文件达到60秒生成一个文件
Host.sinks.spooldirSinks.hdfs.rollInterval = 60

#组装source、channel、sink
Host.sources.spooldirSources.channels = spooldirChannels
Host.sinks.spooldirSinks.channel = spooldirChannels

3.3 给头文件添加 UUID

flume-ng agent \
-n UUID \
-f $FLUME_HOME/jobconf/interceptors/flume-UUID.conf \
-c $FLUME_HOME/conf \
-Dflume.root.logger=INFO,console

输入：exec（监控文件）

拦截器：给数据添加头文件，防止数据倾斜

输出：logs（可以查看到headers中的数据）

#定义agent名， source、channel、sink的名称
UUID.sources = execSources
UUID.channels = execChannels
UUID.sinks = execSinks

#具体定义source
UUID.sources.execSources.type = exec
UUID.sources.execSources.command = tail -F /opt/Andy
UUID.sources.execSources.shell = /bin/bash -c
#定义拦截器，生成UUID头
UUID.sources.execSources.interceptors = i1
UUID.sources.execSources.interceptors.i1.type = org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder
UUID.sources.execSources.interceptors.i1.preserveExisting = true
UUID.sources.execSources.interceptors.i1.prefix = UUID_

#3.定义sink
UUID.sinks.execSinks.type = logger

#具体定义channel
UUID.channels.execChannels.type = memory
UUID.channels.execChannels.capacity = 1000
UUID.channels.execChannels.transactionCapacity = 100

#组装source、channel、sink
UUID.sources.execSources.channels = execChannels
UUID.sinks.execSinks.channel = execChannels

3.4 正则查询替换拦截器

flume-ng agent \
-n Search \
-f $FLUME_HOME/jobconf/interceptors/flume-Search.conf \
-c $FLUME_HOME/conf \
-Dflume.root.logger=INFO,console

输入：exec（监控文件）

拦截器：匹配文字并替换

输出：logs（以便查看）

#定义agent名， source、channel、sink的名称
Search.sources = execSources
Search.channels = execChannels
Search.sinks = execSinks

#具体定义source
Search.sources.execSources.type = exec
Search.sources.execSources.command = tail -F /opt/Andy
Search.sources.execSources.shell = /bin/bash -c
#定义拦截器，search
Search.sources.execSources.interceptors = i1
Search.sources.execSources.interceptors.i1.type = search_replace
#定义正则，将数字开头的替换为itstar
Search.sources.execSources.interceptors.i1.searchPattern = [0-9]+
Search.sources.execSources.interceptors.i1.replaceString = itstar
Search.sources.execSources.interceptors.i1.charset = UTF-8

#3.定义sink
Search.sinks.execSinks.type = logger

#具体定义channel
Search.channels.execChannels.type = memory
Search.channels.execChannels.capacity = 1000
Search.channels.execChannels.transactionCapacity = 100

#组装source、channel、sink
Search.sources.execSources.channels = execChannels
Search.sinks.execSinks.channel = execChannels

3.5 正则表达式Regex

flume-ng agent \
-n Regex \
-f $FLUME_HOME/jobconf/interceptors/flume-RegexFilter|flume-RegexExtractor.conf \
-c $FLUME_HOME/conf \
-Dflume.root.logger=INFO,console

输入：exec（监控文件）

拦截器：regex_filter（正则表达式，剔除满足正则的数据，将满足正则的数据写入数据中）

输出：logs（以便查看）

#定义agent名， source、channel、sink的名称
Regex.sources = execSources
Regex.channels = execChannels
Regex.sinks = execSinks

#具体定义source
Regex.sources.execSources.type = exec
Regex.sources.execSources.command = tail -F /opt/Andy
Regex.sources.execSources.shell = /bin/bash -c

#定义拦截器，Regex
Regex.sources.execSources.interceptors = i1
Regex.sources.execSources.interceptors.i1.type = regex_filter
#定义正则规则
Regex.sources.execSources.interceptors.i1.regex = ^A.*
Regex.sources.execSources.interceptors.i1.excludeEvents = true

#3.定义sink
Regex.sinks.execSinks.type = logger

#具体定义channel
Regex.channels.execChannels.type = memory
Regex.channels.execChannels.capacity = 1000
Regex.channels.execChannels.transactionCapacity = 100

#组装source、channel、sink
Regex.sources.execSources.channels = execChannels
Regex.sinks.execSinks.channel = execChannels

输入：exec（监控文件）

拦截器：regex_extractor（正则表达式，留下满足正则的，将满足正则的数据写入headers中）

输出：logs（以便查看）

#定义agent名， source、channel、sink的名称
Regex.sources = execSources
Regex.channels = execChannels
Regex.sinks = execSinks

#具体定义source
Regex.sources.execSources.type = exec
Regex.sources.execSources.command = tail -F /opt/Andy
Regex.sources.execSources.shell = /bin/bash -c
#定义拦截器，Regex
Regex.sources.execSources.interceptors = i1
Regex.sources.execSources.interceptors.i1.type = regex_extractor
#定义正则规则
Regex.sources.execSources.interceptors.i1.regex = hostname is (.*?) ip is (.*)
Regex.sources.execSources.interceptors.i1.serializers = s1 s2
Regex.sources.execSources.interceptors.i1.serializers.s1.name = cookieid
Regex.sources.execSources.interceptors.i1.serializers.s2.name = ip
#3.定义sink
Regex.sinks.execSinks.type = logger

#具体定义channel
Regex.channels.execChannels.type = memory
Regex.channels.execChannels.capacity = 1000
Regex.channels.execChannels.transactionCapacity = 100

#组装source、channel、sink
Regex.sources.execSources.channels = execChannels
Regex.sinks.execSinks.channel = execChannels

3.6自定义拦截器

运行

$FLUME_HOME/bin/flume-ng agent \
-c conf/ \
-n myInterceptor \
-f $FLUME_HOME/jobconf/interceptors/flume-myInterceptor.conf \
-C /opt/test/sparkStudy-1.0-SNAPSHOT.jar \
-Dflume.root.logger=DEBUG,console

jar部分

package bigdata12.homework11.flume;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;
import org.apache.hadoop.hbase.util.GetJavaProperty;
import java.util.ArrayList;
import java.util.List;

/**
 * flume 自定义拦截器
 * 大小写转换
 * @date 2019/7/3
 * @author Fantome
 */
public class FlumeInterceptor implements Interceptor {

    @Override
    public void initialize() {

    }

    /**
     * 自定义拦截器重新装载
     * @param event 获取数据
     * @return 重新装载数据
     */
    @Override
    public Event intercept(Event event) {
        //获取头数据，为bytes数组
        event.getHeaders();
        //获取文件数据，为bytes数组
        byte[] bodyBytes = event.getBody();
        //bytes 转换为string在转换成大写
        String strBody = new String(bodyBytes).toUpperCase();
        //string 再转换为bytes
        byte[] outBytes = strBody.getBytes();
        //event重新装载数据
        event.setBody(outBytes);
        //将数据传入下一个中进行装载
        return event;
    }

    @Override
    public List<Event> intercept(List<Event> list) {
        //自定义一个list存储数据
        List<Event> list1 =new ArrayList();
        //获取文件，重新装载数据
        for (Event e:list){
            list1.add(intercept(e));
        }
        return list1;
    }

    /**
     * 在conf文件制定的是这个类
     * bigdata12.homework11.flume.FlumeInterceptor$Builder
     */
    public static class Builder implements Interceptor.Builder{
        /**
         * 创建自定义类，并返回
         * @return
         */
        @Override
        public Interceptor build() {
            FlumeInterceptor flumeInterceptor = new FlumeInterceptor();
            return flumeInterceptor;
        }
        @Override
        public void configure(Context context) {

        }
    }
    @Override
    public void close() {

    }
}

conf部分

输入：exec（监控文件）

拦截器：jar引入，(重点)全类名$Builder为重写Interceptor.Builder的类名

输出：HDFS

#1.agent =>myInterceptor
myInterceptor.sources = execSources
myInterceptor.sinks =execSinks
myInterceptor.channels = execChannels
 
# Describe/configure the source
myInterceptor.sources.execSources.type = exec
myInterceptor.sources.execSources.command = tail -F /opt/Andy
myInterceptor.sources.execSources.interceptors = i1

#(重点)全类名$Builder
myInterceptor.sources.execSources.interceptors.i1.type = bigdata12.homework11.flume.FlumeInterceptor$Builder
 
# Describe the sink
myInterceptor.sinks.execSinks.type = hdfs
myInterceptor.sinks.execSinks.hdfs.path = /ToUpCase1
myInterceptor.sinks.execSinks.hdfs.filePrefix = events-
myInterceptor.sinks.execSinks.hdfs.round = true
myInterceptor.sinks.execSinks.hdfs.roundValue = 10
myInterceptor.sinks.execSinks.hdfs.roundUnit = minute
myInterceptor.sinks.execSinks.hdfs.rollInterval = 3
myInterceptor.sinks.execSinks.hdfs.rollSize = 20
myInterceptor.sinks.execSinks.hdfs.rollCount = 5
myInterceptor.sinks.execSinks.hdfs.batchSize = 1
myInterceptor.sinks.execSinks.hdfs.useLocalTimeStamp = true
#生成的文件类型，默认是 Sequencefile，可用 DataStream，则为普通文本
myInterceptor.sinks.execSinks.hdfs.fileType = DataStream
 
# Use a channel which buffers events in memory
myInterceptor.channels.execChannels.type = memory
myInterceptor.channels.execChannels.capacity = 1000
myInterceptor.channels.execChannels.transactionCapacity = 100
 
# Bind the source and sink to the channel
myInterceptor.sources.execSources.channels = execChannels
myInterceptor.sinks.execSinks.channel = execChannels

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：mysql连接数设置 k8s mysql连接参数说明

下一篇：centos8和centos8stream哪个好 centos8和6的区别

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯