介绍
apache fulme
是一个分布式的高可用的框架,可以从不同的数据源大量的操作日志数据,能高效的收集,聚合,移动日志数据集中到存储中。
apahce fulme
不仅仅是日志聚合功能,还能自定义数据源,用于传输大量的事件数据,网络流量数据,社交媒体数据,邮件数据以及其他数据Apache Flume
目前有两种主版本: 0.9.x 和 1.x。其中 0.9.x 是历史版本,称之为 Flume OG(original generation),1.X版本为重构后版本,统称为 Flume NG(next generation)
系统环境要求
- jdk1.7及以上
- 足够的内存
- 足够的磁盘空间
- 允许读写的目录权限
数据流模型
一个 Flume event
(事件,消息,操作)被定义为一个数据流单元。包含字节信息和一些可选的字符串属性集。一个flume agent
是一个JVM
进程。通过agent
能从外部数据源加载数据到下一个目标(可能是数据源,也可能是其他的操作)
- note:
event
解释:是flume数据传输的基本单元,一条记录对应一个event.他有header
和body
组成,body
中存在的是字节数组
flume
下载地址
下载地址:http://archive.cloudera.com/cdh5/cdh/5/flume-ng-1.6.0-cdh5.15.1.tar.gz
文档地址:http://archive.cloudera.com/cdh5/cdh/5/flume-ng-1.6.0-cdh5.15.1/FlumeUserGuide.html
flume
核心组件(agent
构成):
source
:收集数据:从不同的数据源中采集数据,可查看能从哪里收集。官网http://archive.cloudera.com/cdh5/cdh/5/flume-ng-1.6.0-cdh5.15.1/FlumeUserGuide.html#flume-sources
常用的source有(avro/exec/spooling/tailDir/kafka/netcat/http/custom)channel
:鉴于source
,sink
之间,(处理数据,减少磁盘交互)官网:http://archive.cloudera.com/cdh5/cdh/5/flume-ng-1.6.0-cdh5.15.1/FlumeUserGuide.html#flume-channels
常用channel的有(memory/file/kafka/custom)sink
:读取通道中的数据,执行下一步操作官网:http://archive.cloudera.com/cdh5/cdh/5/flume-ng-1.6.0-cdh5.15.1/FlumeUserGuide.html#flume-sinks
常用sink的有(hdfs/logger/avro/kafka/custom)
数据流定义(agent定义)
- 介绍:一个简单的数据流,需要定义
source
,channel
,sink
三大组件,通过channel
连接source
和sink
。一个agent
需要有一个source
列表sink
和channel
。source
和sink
之间有一个通道。注意:一个source
能指定多个channel
,一个sink只能指定一个
channel`
#每一个agent必须有一个名字。
#<Agent>=自定义agent名称
#<Source>=自定义source名称
#<Channel1>=channal1的名称
#<Channel2>=channal2的名称
#<Sink>=sink的名称
# 在agent中定义了一个source列表,channel,sink
<Agent>.sources = <Source>
<Agent>.sinks = <Sink>
<Agent>.channels = <Channel1> <Channel2>
# 设置source和channel之间的关系
<Agent>.sources.<Source>.channels = <Channel1> <Channel2> ...
# 设置sink和channel之间的关系
<Agent>.sinks.<Sink>.channel = <Channel1>
- 举例
agent_foo.sources = avro-appserver-src-1
agent_foo.sinks = hdfs-sink-1
agent_foo.channels = mem-channel-1
agent_foo.sources.avro-appserver-src-1.channels = mem-channel-1
agent_foo.sinks.hdfs-sink-1.channel = mem-channel-1
实战
监听本机的44444端口的数据信息
- 文档:
http://archive.cloudera.com/cdh5/cdh/5/flume-ng-1.6.0-cdh5.15.1/FlumeUserGuide.html#a-simple-example
- 在flume根目录下的
conf
目录下建立一个配置文件example-conf.properties
#定义agent的名称为a1,source为r1,sinks为k1,channels为c1
a1.sources = r1
a1.sinks = k1
a1.channels = c1
#通过端口监听,需要使用的source为netcat(http://archive.cloudera.com/cdh5/cdh/5/flume-ng-1.6.0-cdh5.15.1/FlumeUserGuide.html#netcat-tcp-source)
监听的机器为本机,监听端口为44444
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
#定义sink的类型为logger,也就是日志输出
a1.sinks.k1.type = logger
# 定义的channel类型为memory,直接把source的数据写入memory中
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 配置source,sink和channel之间的关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
- 启动脚本
$ bin/flume-ng agent --conf conf --conf-file conf/example-conf.properties --name a1 -Dflume.root.logger=INFO,console
。
脚本参数解析: - 1.–conf-file 配置文件的位置
- 2.–name
agent
的名称 - 测试
$ telnet localhost 44444
输入内容后会看到控制台有输出内容
# 控制台1 输入内容
hello
# 控制台2 输出内容
2019-09-11 21:46:39,812 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 68 65 6C 6C 6F 0D hello. }
监控新增的文件内容并输出到hdfs
- 在flume根目录下的
conf
目录下建立一个配置文件example-conf1.properties
#定义agent的名称为tailDir-hdfs
tailDir-hdfs.sources = tailDir-source
tailDir-hdfs.sinks = tailDir-sink
tailDir-hdfs.channels = tailDir-channel
tailDir-hdfs.sources.tailDir-source.type = TAILDIR
tailDir-hdfs.sources.tailDir-source.filegroups =example2
tailDir-hdfs.sources.tailDir-source.filegroups.example2 =/home/hadoop/data/flume/example.log
#定义sink的类型为logger,也就是日志输出
tailDir-hdfs.sinks.tailDir-sink.type = hdfs
tailDir-hdfs.sinks.tailDir-sink.hdfs.path = /flume/events/%y-%m-%d
tailDir-hdfs.sinks.tailDir-sink.hdfs.filePrefix = events-
tailDir-hdfs.sinks.tailDir-sink.hdfs.round = true
tailDir-hdfs.sinks.tailDir-sink.hdfs.roundValue = 1
tailDir-hdfs.sinks.tailDir-sink.hdfs.roundUnit = second
#注释后会报错
# tailDir-hdfs.sinks.tailDir-sink.hdfs.useLocalTimeStamp = true
# 定义的channel类型为memory,直接把source的数据写入memory中
tailDir-hdfs.channels.tailDir-channel.type = memory
tailDir-hdfs.channels.tailDir-channel.capacity = 1000
tailDir-hdfs.channels.tailDir-channel.transactionCapacity = 100
# 配置source,sink和channel之间的关系
tailDir-hdfs.sources.tailDir-source.channels = tailDir-channel
tailDir-hdfs.sinks.tailDir-sink.channel = tailDir-channel
- 启动脚本
$ bin/flume-ng agent --conf conf --conf-file conf/example-conf1.properties --name tailDir-hdfs -Dflume.root.logger=INFO,console
。 - 测试.输入内容后会看到控制台有输出内容
# 控制台1 输入内容
$ echo 111 >> /home/hadoop/data/flume/example.log
# 控制台2 输出内容
2019-09-12 02:09:50,169 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.HDFSSequenceFile.configure(HDFSSequenceFile.java:63)] writeFormat = Writable, UseRawLocalFileSystem = false
2019-09-12 02:09:50,195 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:251)] Creating /flume/event/19-09-12/0209/50/events-.1568268590170.tmp
2019-09-12 02:10:14,851 (hdfs-tailDir-sink-roll-timer-0) [INFO - org.apache.flume.sink.hdfs.BucketWriter.close(BucketWriter.java:393)] Closing /flume/event/19-09-12/0209/43/events-.1568268583163.tmp
2019-09-12 02:10:14,911 (hdfs-tailDir-sink-call-runner-7) [INFO - org.apache.flume.sink.hdfs.BucketWriter$8.call(BucketWriter.java:655)] Renaming /flume/event/19-09-12/0209/43/events-.1568268583163.tmp to /flume/event/19-09-12/0209/43/events-.1568268583163
2019-09-12 02:10:14,919 (hdfs-tailDir-sink-roll-timer-0) [INFO - org.apache.flume.sink.hdfs.HDFSEventSink$1.run(HDFSEventSink.java:382)] Writer callback called.
2019-09-12 02:10:20,250 (hdfs-tailDir-sink-roll-timer-0) [INFO - org.apache.flume.sink.hdfs.BucketWriter.close(BucketWriter.java:393)] Closing /flume/event/19-09-12/0209/50/events-.1568268590170.tmp
2019-09-12 02:10:20,273 (hdfs-tailDir-sink-call-runner-9) [INFO - org.apache.flume.sink.hdfs.BucketWriter$8.call(BucketWriter.java:655)] Renaming /flume/event/19-09-12/0209/50/events-.1568268590170.tmp to /flume/event/19-09-12/0209/50/events-.1568268590170
2019-09-12 02:10:20,282 (hdfs-tailDir-sink-roll-timer-0) [INFO - org.apache.flume.sink.hdfs.HDFSEventSink$1.run(HDFSEventSink.java:382)] Writer callback called.
- 可能会出现的错误
2019-09-12 01:59:30,612 (SinkRunner-PollingRunner-DefaultSinkProcessor) [ERROR - org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:447)] process failed
java.lang.NullPointerException: Expected timestamp in the Flume event headers, but it was null
at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:204)
at org.apache.flume.formatter.output.BucketPath.replaceShorthand(BucketPath.java:251)
at org.apache.flume.formatter.output.BucketPath.escapeString(BucketPath.java:460)
at org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:368)
at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:67)
at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:145)
at java.lang.Thread.run(Thread.java:748)
2019-09-12 01:59:30,613 (SinkRunner-PollingRunner-DefaultSinkProcessor) [ERROR - org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:158)] Unable to deliver event. Exception follows.
org.apache.flume.EventDeliveryException: java.lang.NullPointerException: Expected timestamp in the Flume event headers, but it was null
at org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:451)
at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:67)
at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:145)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException: Expected timestamp in the Flume event headers, but it was null
at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:204)
at org.apache.flume.formatter.output.BucketPath.replaceShorthand(BucketPath.java:251)
at org.apache.flume.formatter.output.BucketPath.escapeString(BucketPath.java:460)
at org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:368)
... 3 more
- 可在hdfs文件系统中查看生成后的内容