1 官网地址

apache链接 cdh链接

2 产生背景
对于关系型数据库我们可以使用sqoop进行数据的处理,导入hive,hdfs,mysql等。那对于一些日志该怎么处理呢?(From 
 outside To  inside ),怎么样定时收集ng产生的日志到HDFS呢?
我们可能想到直接使用shell写一个脚本,使用crontab进行调度,这样不就行了吗。。但是大家有没有想到一个问题呢,就是如
果日志量太大,涉及到存储格式压缩格式的选择该怎么办呢?
这时我们的flume就产生了,他是apache的一个顶级项目,下面就开始我们的学习。
3 Flume 介绍

Flume NG是一个分布式、可靠、可用的系统,它能够将不同数据源的海量日志数据进行高效收集、聚合、移动,最后存储到一个中心化数据存储系统中。由原来的Flume OG到现在的Flume NG,进行了架构重构,并且现在NG版本完全不兼容原来的OG版本。经过架构重构后,Flume NG更像是一个轻量的小工具,非常简单,容易适应各种方式日志收集,并支持failover和负载均衡。

3.1 使用场景
flume->HDFS->batch
flume->kafka->streaming
3.2 基本架构

Flume 动态 topic flume部署_Flume 动态 topic

3.3 Event的概念

在这里有必要先介绍一下flume中event的相关概念:flume的核心是把数据从数据源(source)收集过来,在将收集到的数据送到指定的目的地(sink)。为了保证输送的过程一定成功,在送到目的地(sink)之前,会先缓存数据(channel),待数据真正到达目的地(sink)后,flume在删除自己缓存的数据。 在整个数据的传输的过程中,流动的是event,即事务保证是在event级别进行的。那么什么是event呢?—–event将传输的数据进行封装,是flume传输数据的基本单位,如果是文本文件,通常是一行记录,event也是事务的基本单位。event从source,流向channel,再到sink,本身为一个字节数组,并可携带headers(头信息)信息。event代表着一个数据的最小完整单元,从外部数据源来,向外部的目的地去。

3.4 flume三大核心组件

flume之所以这么神奇,是源于它自身的一个设计,这个设计就是agent,agent本身是一个java进程,运行在日志收集节点—所谓日志收集节点就是服务器节点。

  1. Source:负责从源端采集数据,输出到channel中,常用的Source有exec/Spooling Directory/Taildir Source/NetCat
  2. Channel:负责缓存Source端来的数据,常用的Channel有Memory/File
  3. Sink:处理Channel而来的数据写到目标端,常用的Sink有HDFS/Logger/Avro/Kafka
  • Source
  • Sink
  • Channel

Source+Channel+Sink=Agent,数据以event的形式从Source传送到Sink端,Flume就是写配置文件把我们的三大核心组件拼接起来,使用方便,可配置的、可插拔的、可组装的。

3.5 File Channel VS Memory Channel

File Channel是一个持久化的隧道(channel),他持久化所有的事件,并将其存储到磁盘中。因此,即使Java 虚拟机当掉,或者操作系统崩溃或重启,再或者事件没有在管道中成功地传递到下一个代理(agent),这一切都不会造成数据丢失。Memory Channel是一个不稳定的隧道,其原因是由于它在内存中存储所有事件。如果java进程死掉,任何存储在内存的事件将会丢失。另外,内存的空间收到RAM大小的限制,而File Channel这方面是它的优势,只要磁盘空间足够,它就可以将所有事件数据存储到磁盘上。

3.5 Flume的常用模式
  • 扇入 注意:
  1. 这里多个sink节点写入一个source是为了减少同时写入hdfs上的压力;
  2. 多个agent进行串联时,前一个agent的Sink和后一个agent的Source都要采用Avro的形式。
  • 扇出
  • Flume 动态 topic flume部署_java_02

  • 这里的扇出是不是和离线数据的处理有点像呢?
4 安装

作者下载的时cdh的版本下载链接

4.1 配置FLUME_HOME
export FLUME_HOME=/opt/software/flume-1.6.0-cdh5.7.0-bin

export PATH=$FLUME_HOME/bin:
4.2 修改配置文件
cp flume-env.sh.template flume-env.sh
 export JAVA_HOME=/usr/java/jdk1.8.0_45
5 如何使用
5.1( Source:NetCat) (Sink:logger) (Channel:memory)

NetCat Source:监听一个指定的网络端口,即只要应用程序向这个端口里面写数据,这个source组件就可以获取到信息。

Property Name Default     Description
channels       –     
type           –     The component type name, needs to be netcat
bind           –  日志需要发送到的主机名或者Ip地址,该主机运行着netcat类型的source在监听          
port           –  日志需要发送到的端口号,该端口号要有netcat类型的source在监听
5.1.1 配置文件

官方地址 官网上有详细的介绍,大家使用时可以去官网查找,但是每个版本对应的配置可能不同,一定要去对应的版本下去找。

vi hello.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory


# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
5.1.2 启动命令

帮助命令flume-ng help

  • 启动命令
./flume-ng agent --name a1 --conf $FLUME_HOME/conf --conf-file $FLUME_HOME/conf/hello.conf -Dflume.root.logger=INFO,console

--name agent的名字
--conf conf目录
--conf-file 配置文件所在目录
-Dflume.root.logger=INFO,console 可以再控制台查看
#5.1.3 测试
[hadoop@hadoop ~]$ telnet localhost 44444
Trying ::1...
Connected to localhost.
Escape character is '^]'.
hello
OK
wold
OK

注: 如果telnet命令不可用,自行安装客户端和服务端,不会的自己百度吧。

ServerSocketChannelImpl[/0:0:0:0:0:0:0:0:44444]
2018-04-21 17:52:28,531 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 68 65 6C 6C 6F 0D                               hello. }
2018-04-21 18:00:42,815 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 77 6F 6C 64 0D 
可以看到hello wold收到了
5.2 ( Source:NetCat) (Sink:hdfs) (Channel:file)

将日志写入到hdfs上

Flume 动态 topic flume部署_java_03

5.2.1配置文件
vi test1.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://hadoop:9000/flume/
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollSize = 0
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.filePrefix = %Y-%m-%d-%H-%M-%S
a1.sinks.k1.hdfs.useLocalTimeStamp = true

# Use a channel which buffers events in file
a1.channels.c1.type = file

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
5.2.2启动命令
./flume-ng agent --name a1 --conf $FLUME_HOME/conf --conf-file $FLUME_HOME/conf/test1.conf   -Dflume.root.logger=DEBUG,console

2018-04-21 18:55:57,906 (lifecycleSupervisor-1-3) [DEBUG - org.apache.flume.source.NetcatSource.start(NetcatSource.java:190)] Source started
2018-04-21 18:55:57,907 (Thread-2) [DEBUG - org.apache.flume.source.NetcatSource$AcceptHandler.run(NetcatSource.java:270)] Starting accept handler
5.2.3结果
[hadoop@hadoop ~]$ telnet localhost 44444
Trying ::1...
Connected to localhost.
Escape character is '^]'.
hello world
OK

hdfs dfs -text /flume/2018-04-21-19-05-47.1524366347156
hello world
5.3 (Source:Spooling Directory )( Sink:hdfs)(Channel:memory )

Spooling Directory Source:监听一个指定的目录,即只要应用程序向这个指定的目录中添加新的文件,source组件就可以获取到该信息,并解析该文件的内容,然后写入到channle。写入完成后,标记该文件已完成或者删除该文件。 官网介绍,其可靠性较强,而且即使flume重启,也不会丢失数据,为了保证可靠性,只能是不可变的,唯一命名的文件可以放在目录下,日常来说,我们可以通过log4j来定义日志名称,这样基本不会重名,而且日志文件生成之后,一般来说都不会更改,所以离线数据处理,很适合使用本Source;

  • flume官网中Spooling Directory Source描述:
Property Name       Default      Description
channels              –  
type                  –          The component type name, needs to be spooldir.
spoolDir              –          Spooling Directory Source监听的目录
fileSuffix         .COMPLETED    文件内容写入到channel之后,标记该文件
deletePolicy       never         文件内容写入到channel之后的删除策略: never or immediate
fileHeader         false         Whether to add a header storing the absolute path filename.
ignorePattern      ^$           Regular expression specifying which files to ignore (skip)
interceptors          –          指定传输中event的head(头信息),常用timestamp

Spooling Directory Source的两个注意事项:

1 If a file is written to after being placed into the spooling directory, Flume will print an error to its log file and stop processing.
即:拷贝到spool目录下的文件不可以再打开编辑
2 If a file name is reused at a later time, Flume will print an error to its log file and stop processing.
即:不能将具有相同文件名字的文件拷贝到这个目录下
5.3.1配置文件
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /home/hadoop/inputfile
a1.sources.r1.fileHeader = true
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp

# Describe the sink
# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://hadoop:9000/flume/
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollSize = 0
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.filePrefix = %Y-%m-%d-%H-%M-%S
a1.sinks.k1.hdfs.useLocalTimeStamp = true

# Use a channel which buffers events in file
a1.channels.c1.type = file

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
5.3.2启动命令
./flume-ng agent --name a1 --conf $FLUME_HOME/conf --conf-file $FLUME_HOME/conf/test2.conf  -Dflume.root.logger=INFO,console

2018-04-21 19:52:14,083 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.channel.file.FileChannel.start(FileChannel.java:301)] Queue Size after replay: 0 [channel=c1]
2018-04-21 19:52:14,186 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.register(MonitoredCounterGroup.java:120)] Monitored counter group for type: CHANNEL, name: c1: Successfully registered new MBean.
2018-04-21 19:52:14,188 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:96)] Component type: CHANNEL, name: c1 started
2018-04-21 19:52:14,188 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:173)] Starting Sink k1
2018-04-21 19:52:14,190 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:184)] Starting Source r1
2018-04-21 19:52:14,190 (lifecycleSupervisor-1-3) [INFO - org.apache.flume.source.SpoolDirectorySource.start(SpoolDirectorySource.java:78)] SpoolDirectorySource source starting with directory: /home/hadoop/inputfile
2018-04-21 19:52:14,200 (lifecycleSupervisor-1-1) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.register(MonitoredCounterGroup.java:120)] Monitored counter group for type: SINK, name: k1: Successfully registered new MBean.
2018-04-21 19:52:14,201 (lifecycleSupervisor-1-1) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:96)] Component type: SINK, name: k1 started
2018-04-21 19:52:14,240 (lifecycleSupervisor-1-3) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.register(MonitoredCounterGroup.java:120)] Monitored counter group for type: SOURCE, name: r1: Successfully registered new MBean.
2018-04-21 19:52:14,241 (lifecycleSupervisor-1-3) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:96)] Component type: SOURCE, name: r1 started


cp input.txt ../inputfile/
5.3.3结果
[hadoop@hadoop conf]$ hdfs dfs -text /flume/2018-04-21-19-53-24.1524369204246 
hello java
[hadoop@hadoop conf]$ hdfs dfs -text /flume/2018-04-21-19-53-26.1524369206318
hello hadoop
hello hive
hello sqoop
hello hdfs
hello spark

再日志中也能看到输入文件是否成功
5.4 (Source:Exec Source )( Sink:hdfs)(Channel:memory )

Exec Source:监听一个指定的命令,获取一条命令的结果作为它的数据源 常用的是tail -F file指令,即只要应用程序向日志(文件)里面写数据,source组件就可以获取到日志(文件)中最新的内容 。

5.4.1配置文件
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /home/hadoop/data/data.log

# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://hadoop:9000/flume
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollSize = 0
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.filePrefix = %Y-%m-%d-%H-%M-%S
a1.sinks.k1.hdfs.useLocalTimeStamp = true


# Use a channel which buffers events in memory
a1.channels.c1.type=memory
a1.channels.c1.capacity=10000
a1.channels.c1.transactionCapacity=1000

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

在hive中建立外部表—–hdfs://hadoop80:9000/flume的目录,方便查看日志捕获内容

create external table t1(infor  string)
row format delimited
fields terminated by '\t'
location '/flume/';
5.4.2启动命令
./flume-ng agent --name a1 --conf $FLUME_HOME/conf --conf-file $FLUME_HOME/conf/test3.conf  -Dflume.root.logger=INFO,console

echo hadoop >data.log
5.4.3结果
hdfs上:
[hadoop@hadoop ~]$ hdfs dfs -text /flume/2018-04-21-20-13-38.1524370418338 
hadoop

hive:
hive> select * from t1;
OK
hello world
hello java
hello hadoop
hello hive
hello sqoop
hello hdfs
hello spark
hadoop

总结Exec source: Exec source和Spooling Directory Source是两种常用的日志采集的方式,其中Exec source可以实现对日志的实时采集,Spooling Directory Source在对日志的实时采集上稍有欠缺,尽管Exec source可以实现对日志的实时采集,但是当Flume不运行或者指令执行出错时,Exec source将无法收集到日志数据,日志会出现丢失,从而无法保证收集日志的完整性。