介绍

  1. 本文对Flume框架进行了简单的介绍,内容如下
  2. 如何在安装Linux上安装Flume框架
  3. 如何动态读取一个日志文件
  4. 如何使用Flume将文件存储到HDFS上
  5. 如何使用Flume将文件存储到HDFS指定目录下
  6. 如何使用Flume使用分区方式将文件存储到HDFS上
  7. 如何动态监听一个文件夹中的内容
  8. 如何过滤不想加载到Flume中的文件
  9. 如何实现动态监听多个文件与文件

1:Flume简单介绍与安装

1.1:Flume介绍

(1)分布式:

可以在多台机器上运行多个flume,日志文件往往分布在不同的机器里面

(2)collecting, aggregating, and moving

           收集              聚集                移动

(3)组件agent

              source:从数据源读取数据的,将数据转换为数据流,将数据丢给channel

              channel:类似于一个队列,临时存储source发送过来的数据

              sink:负责从channel中读取数据, 然后发送给目的地

(4)flume的使用很简单,就是一个配置文件,

1.2:Flume版本

flume-ng:(next generation):  目前使用该版本

flume-og:(Original generation):以前的版本,淘汰

1.3 :Flume安装

环境要求:Linux下,hadoop环境安装完成;JDK安装完成

安装配置:

(1)修改文件名,配置JDK

1:mv flume-env.sh.template  flume-env.sh

flume配置采集日志文件 flume读取日志文件_Flume实现分区存储

(2)找到HDFS的地址:

 方法1.声明Hadoop_home为全局环境变量

全局配置

方法2.将core-site.xml和hdfs-site.xml放到flume配置文件下(推荐)

cp /opt/cdh5.7.6/hadoop-2.6.0-cdh5.7.6/etc/hadoop/core-site.xml /opt/cdh5.7.6/hadoop-2.6.0-cdh5.7.6/etc/hadoop/hdfs-site.xml  ./

方法3.直接在使用的时候给HDFS绝对路径

 hdfs://hostname:8020/aa/bb

(3)添加HDFS的Jar包lib目录下:在执行的过程中需要使用HDFS api

测试案例1:读取Hive日志信息到控制台

flume-conf.properties配置文件

# The configuration file needs to define the sources, 
# the channels and the sinks.
# Sources, channels and sinks are defined per a1, 
# in this case called 'a1'

a1.sources = s1
a1.channels = c1
a1.sinks = k1

# defined sources
a1.sources.s1.type = exec
a1.sources.s1.command = tail -F /opt/cdh5.7.6/hive-1.1.0-cdh5.7.6/logs/hive.log
a1.sources.s1.shell=/bin/sh -c


# defined channel
a1.channels.c1.type = memory
#容量
a1.channels.c1.capacity=1000
#读取数据容量
a1.channels.c1.transactionCapacity=100


# defined sink
a1.sinks.k1.type = logger

#bond
a1.sinks.k1.channel = c1
a1.sources.s1.channels = c1

在Flume目录下输入命令

bin/flume-ng agent -n a1 -c conf -f conf/flume-conf.properties -Dflume.root.logger=INFO,console

结果:控制台输出为二进制,所以看不出结果

flume配置采集日志文件 flume读取日志文件_Flume实现分区存储_02

测试案例2:读取Hive日志信息到HDFS上

flume-conf.properties配置文件

# in this case called 'a1'

a1.sources = s1
a1.channels = c1
a1.sinks = k1

# For each one of the sources, the type is defined
a1.sources.s1.type = exec
a1.sources.s1.command = tail -F /opt/cdh5.7.6/hive-1.1.0-cdh5.7.6/logs/hive.log
a1.sources.s1.shell = /bin/sh -c


# Each channel's type is defined.
a1.channels.c1.type = file
a1.channels.c1.checkpointDir=/opt/datas/flume/channel/checkpoint
a1.channels.c1.dataDirs=/opt/datas/flume/channel/data



# Each sink's type must be defined
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path=/flume/hdfs2/
#设置文件类型和写的格式,解决中文乱码
a1.sinks.k1.hdfs.fileType=DataStream
a1.sinks.k1.hdfs.writeFormat=Text

#Specify the channel the sink should use
a1.sinks.k1.channel = c1
a1.sources.s1.channels = c1

命令:与案例1一样

HDFS结果 

flume配置采集日志文件 flume读取日志文件_flume配置采集日志文件_03

文件内容:

2019-07-15 05:34:11,930 INFO  [main]: ql.Driver (Driver.java:compile(500)) - Semantic Analysis Completed
2019-07-15 05:34:12,060 INFO  [main]: ql.Driver (Driver.java:getSchema(266)) - Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:tab_name, type:string, comment:from deserializer)], properties:null)
2019-07-15 05:34:12,646 INFO  [main]: ql.Driver (Driver.java:compile(607)) - Completed compiling command(queryId=huadian_20190715053434_d650e334-f5f3-4e7e-b030-0622a759f812); Time taken: 1.595 seconds
2019-07-15 05:34:12,647 INFO  [main]: ql.Driver (Driver.java:checkConcurrency(186)) - Concurrency mode is disabled, not creating a lock manager
2019-07-15 05:34:12,647 INFO  [main]: ql.Driver (Driver.java:execute(1598)) - Executing command(queryId=huadian_20190715053434_d650e334-f5f3-4e7e-b030-0622a759f812): show tables
2019-07-15 05:34:12,665 INFO  [main]: ql.Driver (Driver.java:launchTask(1968)) - Starting task [Stage-0:DDL] in serial mode
2019-07-15 05:34:12,830 INFO  [main]: ql.Driver (Driver.java:execute(1877)) - Completed executing command(queryId=huadian_20190715053434_d650e334-f5f3-4e7e-b030-0622a759f812); Time taken: 0.183 seconds

测试案例3:存储在HDFS文件大小的问题,解决小文件问题

flume-conf.properties配置文件

# in this case called 'a1'

a1.sources = s1
a1.channels = c1
a1.sinks = k1

# For each one of the sources, the type is defined
a1.sources.s1.type = exec
a1.sources.s1.command = tail -F /opt/cdh5.7.6/hive-1.1.0-cdh5.7.6/logs/hive.log
a1.sources.s1.shell = /bin/sh -c


# Each channel's type is defined.
a1.channels.c1.type = file
a1.channels.c1.checkpointDir=/opt/datas/flume/channel/checkpoint
a1.channels.c1.dataDirs=/opt/datas/flume/channel/data



# Each sink's type must be defined
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path=/flume/hdfs1/
#设置文件类型和写的格式,解决中文乱码
a1.sinks.k1.hdfs.fileType=DataStream
a1.sinks.k1.hdfs.writeFormat=Text

#设置多久进行一次保存
a1.sinks.k1.hdfs.rollInterval=0
#设置HDFS文件大小
a1.sinks.k1.hdfs.rollSize=10240
#在滚动之前写入文件的事件数
a1.sinks.k1.hdfs.rollCount=0

#Specify the channel the sink should use
a1.sinks.k1.channel = c1
a1.sources.s1.channels = c1

结果:文件大小明显变化了

flume配置采集日志文件 flume读取日志文件_flume配置采集日志文件_04

 测试案例4:数据指定目录到hdfs中,导入hive分区

设置每年每月每天每分钟进行分区

flume-conf.properties配置文件

# in this case called 'a1'

a1.sources = s1
a1.channels = c1
a1.sinks = k1

# For each one of the sources, the type is defined
a1.sources.s1.type = exec
a1.sources.s1.command = tail -F /opt/cdh5.7.6/hive-1.1.0-cdh5.7.6/logs/hive.log
a1.sources.s1.shell = /bin/sh -c


# Each channel's type is defined.
a1.channels.c1.type = file
a1.channels.c1.checkpointDir=/opt/datas/flume/channel/checkpoint
a1.channels.c1.dataDirs=/opt/datas/flume/channel/data



# Each sink's type must be defined
a1.sinks.k1.type = hdfs
#hdfs中存储文件的路径
a1.sinks.k1.hdfs.path=/flume/part/yearst=%Y/monthstr=%m/daystr=%d/minutestr=%M
#使用了时间格式,必须设置该属性
a1.sinks.k1.hdfs.useLocalTimeStamp=true
#设置文件类型和写的格式,解决中文乱码
a1.sinks.k1.hdfs.fileType=DataStream
a1.sinks.k1.hdfs.writeFormat=Text

#设置多久进行一次保存
a1.sinks.k1.hdfs.rollInterval=0
#设置HDFS文件大小
a1.sinks.k1.hdfs.rollSize=10240
#在滚动之前写入文件的事件数
a1.sinks.k1.hdfs.rollCount=0

#Specify the channel the sink should use
a1.sinks.k1.channel = c1
a1.sources.s1.channels = c1

结果:

/flume/part/yearst=2019/monthstr=07/daystr=15/minutestr=50

flume配置采集日志文件 flume读取日志文件_Flume框架使用_05

导入Hive问题:

将Flume的文件导入Hive中,操作起来比较麻烦

原因一:

要求Hive表中的数据的存储格式必须为ORC(列式存储)

原因二:

要求Hive表为桶表、按照每条数据进行分桶

测试案例5:如何动态监听一个目录Spooling Directory Source

flume-conf.properties配置文件

# in this case called 'a1'

a1.sources = s1
a1.channels = c1
a1.sinks = k1

# defined sources
#设置Flume扫描文件夹
a1.sources.s1.type = spooldir
#具体扫描哪一个文件夹
a1.sources.s1.spoolDir = /opt/datas/flume/spool

# Each channel's type is defined.
a1.channels.c1.type = file
a1.channels.c1.checkpointDir=/opt/datas/flume/channel/checkpoint
a1.channels.c1.dataDirs=/opt/datas/flume/channel/data

# Each sink's type must be defined
a1.sinks.k1.type = hdfs
#hdfs中存储文件的路径
a1.sinks.k1.hdfs.path=/flume/part/yearst=%Y/monthstr=%m/daystr=%d/minutestr=%M
#使用了时间格式,必须设置该属性
a1.sinks.k1.hdfs.useLocalTimeStamp=true
#设置文件类型和写的格式,解决中文乱码
a1.sinks.k1.hdfs.fileType=DataStream
a1.sinks.k1.hdfs.writeFormat=Text

#设置多久进行一次保存
a1.sinks.k1.hdfs.rollInterval=0
#设置HDFS文件大小
a1.sinks.k1.hdfs.rollSize=10240
#在滚动之前写入文件的事件数
a1.sinks.k1.hdfs.rollCount=0

#Specify the channel the sink should use
a1.sinks.k1.channel = c1
a1.sources.s1.channels = c1

结果:可以看到Linux下文件被加载成功了

flume配置采集日志文件 flume读取日志文件_Flume动态监控多个文件夹与文件_06

测试案例6:过滤不被加载到Flume中的文件

在案例5中,被加载的文件只会被加载一次

这样后续写入到文件里的数据就不会被读取

为了解决这个问题,可以添加过滤操作

在需要加载该文件时,修改文件名,对该文件进行加载

flume-conf.properties配置文件

# in this case called 'a1'

a1.sources = s1
a1.channels = c1
a1.sinks = k1

# defined sources
#设置Flume扫描文件夹
a1.sources.s1.type = spooldir
#具体扫描哪一个文件夹
a1.sources.s1.spoolDir = /opt/datas/flume/spool
#正则过滤
a1.sources.s1.ignorePattern=([^ ]*\.tmp)

# Each channel's type is defined.
a1.channels.c1.type = file
a1.channels.c1.checkpointDir=/opt/datas/flume/channel/checkpoint
a1.channels.c1.dataDirs=/opt/datas/flume/channel/data



# Each sink's type must be defined
a1.sinks.k1.type = hdfs
#hdfs中存储文件的路径
a1.sinks.k1.hdfs.path=/flume/part/yearst=%Y/monthstr=%m/daystr=%d/minutestr=%M
#使用了时间格式,必须设置该属性
a1.sinks.k1.hdfs.useLocalTimeStamp=true
#设置文件类型和写的格式,解决中文乱码
a1.sinks.k1.hdfs.fileType=DataStream
a1.sinks.k1.hdfs.writeFormat=Text

#设置多久进行一次保存
a1.sinks.k1.hdfs.rollInterval=0
#设置HDFS文件大小
a1.sinks.k1.hdfs.rollSize=10240
#在滚动之前写入文件的事件数
a1.sinks.k1.hdfs.rollCount=0

#Specify the channel the sink should use
a1.sinks.k1.channel = c1
a1.sources.s1.channels = c1

结果:可以看到后缀为.tmp 的文件没有被加载

flume配置采集日志文件 flume读取日志文件_Flume实现分区存储_07

测试案例7:动态监听多个文件,并加载到内存中

flume-conf.properties配置文件

# in this case called 'a1'

a1.sources = s1
a1.channels = c1
a1.sinks = k1

# defined sources
#如果是自己编译的类,这里写类的全路径
a1.sources.s1.type = TAILDIR
a1.sources.s1.positionFile =/opt/cdh5.7.6/flume-1.6.0-cdh5.7.6-bin/position/taildir_position.json
#文件组的绝对路径。 正则表达式(而不是文件系统模式)只能用于文件名。
a1.sources.s1.filegroups = f1 f2

a1.sources.s1.filegroups.f1 = /opt/datas/flume/taildir/test.txt
#标题值,使用标题键设置。 可以为一个文件组指定多个标头
a1.sources.s1.headers.f1.age = 17
a1.sources.s1.headers.f1.type = bb

a1.sources.s1.filegroups.f2 = /opt/datas/flume/taildir/huadian/.*
#标题值,使用标题键设置。 可以为一个文件组指定多个标头
a1.sources.s1.headers.f2.age = 18
a1.sources.s1.headers.f2.type = aa

# Each channel's type is defined.
a1.channels.c1.type = memory
#容量
a1.channels.c1.capacity=1000
#一次写出多少文件
a1.channels.c1.transactionCapacity=100


# Each sink's type must be defined
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path=/flume/taildir
#设置文件类型和写的格式,解决中文乱码
a1.sinks.k1.hdfs.fileType=DataStream
a1.sinks.k1.hdfs.writeFormat=Text


#设置多久进行一次保存
a1.sinks.k1.hdfs.rollInterval=0
#设置HDFS文件大小
a1.sinks.k1.hdfs.rollSize=10240
#在滚动之前写入文件的事件数
a1.sinks.k1.hdfs.rollCount=0

#Specify the channel the sink should use
a1.sinks.k1.channel = c1
a1.sources.s1.channels = c1

结果

flume配置采集日志文件 flume读取日志文件_hdfs_08

文件内容:

i
am
a
chinese

i
love
my
country

追加内容后Flume中的内容:

flume配置采集日志文件 flume读取日志文件_Flume框架使用_09

i
am
a
chinese

i
love
my
country
test1