flume 读取hdfs flume读取日志文件

转载

detailtoo 2024-02-18 20:34:52

文章标签 flume 读取hdfs 大数据 hadoop flume hdfs 文章分类 架构后端开发

案例1：监控某个文件夹的变化，将添加的新文件采集存入到hdfs

数据源官网

采集配置文件

启动之前需要的准备工作

启动flume

测试

出现错误

重新启动flume，并往日志文件夹上传一个文件，查看结果

案例2：监控某个文件的变化，把变化的内容存储到hdfs上

采集方案

测试采集功能

查看HDFS上的结果

这篇文章我们来介绍两个flume日志采集的实战案例，案例1：监控某个文件夹的变化，将添加的新文件采集存入到hdfs。案例2：监控某个文件的变化，把变化的内容存储到hdfs上。

flume的安装部署测试在上一篇文章（8）

案例1：监控某个文件夹的变化，将添加的新文件采集存入到hdfs

采集服务器下的某个文件夹（日志文件夹），在该文件夹下产生一个新文件，则该文件中的数据就会被传送到hdfs上的某个文件夹下

采集方案需要确定三大部分

数据源官网

flume 读取hdfs flume读取日志文件_hdfs_02

数据下沉：

flume 读取hdfs flume读取日志文件_hadoop_03

flume 读取hdfs flume读取日志文件_flume_04

采集配置文件

flume 读取hdfs flume读取日志文件_hadoop_05

flume 读取hdfs flume读取日志文件_大数据_06

最终的配置文件

# example.conf: A single-node Flume configuration

# Name the components on this agent
# 定义代理的名字a1及各个组件sources、sinks和channels

a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
# 定义数据源
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /var/log/apache/flumeSpool
a1.sources.r1.fileHeader = true

# Describe the sink
# 定义数据的目的地（下沉）
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.useLocalTimeStamp = true


# Use a channel which buffers events in memory
# 定义管道
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
# 组装组件
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

# 启动命令
# flume-ng agent --conf conf --conf-file conf/spoolingdir-hdfs.conf --name a1 -Dflume.root.logger=INFO,console

启动之前需要的准备工作

新建文件夹

flume 读取hdfs flume读取日志文件_hadoop_07

启动dfs集群

flume 读取hdfs flume读取日志文件_大数据_08

启动flume

在bin文件下运行要加上conf

flume-ng agent --conf conf --conf-file conf/spoolingdir-hdfs.conf --name a1 -Dflume.root.logger=INFO,console

flume 读取hdfs flume读取日志文件_hadoop_09

如果直接在conf文件下运行要把flume-ng agent --conf conf --conf-file conf/spoolingdir-hdfs.conf --name a1 -Dflume.root.logger=INFO,console

删掉

测试

flume会检测/var/log/apache/flumeSpool/文件夹下是否有新文件产生，往该文件夹下上传一个文件，flume会把该文件上传到hdfs集群中，

flume 读取hdfs flume读取日志文件_hadoop_11

出现错误

flume 读取hdfs flume读取日志文件_hdfs_12

flume 读取hdfs flume读取日志文件_hadoop_13

由于guava版本不一致造成、

需要去hadoop里找一份然后替换

flume 读取hdfs flume读取日志文件_hadoop_15

flume 读取hdfs flume读取日志文件_hdfs_16

flume 读取hdfs flume读取日志文件_hadoop_17

重新启动flume，并往日志文件夹上传一个文件，查看结果

flume 读取hdfs flume读取日志文件_flume_18

flume 读取hdfs flume读取日志文件_flume 读取hdfs_19

flume 读取hdfs flume读取日志文件_flume 读取hdfs_20

查看文件有乱码，说明采集方案需要改进

flume 读取hdfs flume读取日志文件_hdfs_21

修改采集方案，重新启动flume，上传文件测试结果

上边的采集文档就是最终修改好了的

所以最后结果应该是这样

flume 读取hdfs flume读取日志文件_flume 读取hdfs_22

案例2：监控某个文件的变化，把变化的内容存储到hdfs上

采集方案

数据源

flume 读取hdfs flume读取日志文件_flume_24

数据下沉不用修改

flume 读取hdfs flume读取日志文件_flume_25

最终的配置文件

# example.conf: A single-node Flume configuration

# Name the components on this agent
# 定义代理的名字a1及各个组件sources、sinks和channels

a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
# 定义数据源

a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /var/log/test.log

# Describe the sink
# 定义数据的目的地（下沉）
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S
a1.sinks.k1.hdfs.filePrefix = events-
# 是否循环创建文件夹
a1.sinks.k1.hdfs.round = true
# 循环创建文件夹的时间间隔是十分钟
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
# 使用本地时间个数
a1.sinks.k1.hdfs.useLocalTimeStamp = true
// 列编辑模式，按住alt选择多列

# 时间间隔
a1.sinks.k1.hdfs.rollInterval = 3
# 大小间隔
a1.sinks.k1.hdfs.rollSize = 20
# event的个数，这三个参数谁先满足就出发循环滚动
a1.sinks.k1.hdfs.rollCount = 5
# 批处理数量
a1.sinks.k1.hdfs.batchSize = 1 
# 文件格式 表示普通文本文件
a1.sinks.k1.hdfs.fileType = DataStream


# Use a channel which buffers events in memory
# 定义管道
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
# 组装组件
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

# 启动命令
# flume-ng agent --conf conf --conf-file conf/exec-hdfs.conf --name a1 -Dflume.root.logger=INFO,console

测试采集功能

模拟场景：

先写一个shell脚本，持续输出当前日期到监控文件/var/log/test.log中，模拟服务器日志的文件

flume 读取hdfs flume读取日志文件_hdfs_27

再克隆一个会话，查看新增的内容

flume 读取hdfs flume读取日志文件_hdfs_28

启动flume

flume 读取hdfs flume读取日志文件_大数据_29

查看HDFS上的结果

flume 读取hdfs flume读取日志文件_hdfs_30

注意，一定要删除这个flume

刚才一直在循环产生日志

一个128m

flume 读取hdfs flume读取日志文件_大数据_31

到了这里flume的两个案例就完成了，flume是大数据项目里是非常重要的一环，下篇文章（10）还是关于flume的可靠性保证--负载均衡和故障恢复。

flume拦截器的一部分还有一个bug，当我改完给大家介绍

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：centos7 搭建etcd centos7 搭建飞刃

下一篇：java自己编写一个modbustcp通讯客户端 java编写tcp服务程序

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯