日志采集架构对比日志采集系统

转载

mob64ca1412b28c 2023-08-12 21:00:44

文章标签 日志采集架构对比 hdfs hadoop 配置文件 文章分类 架构后端开发

1. 前言

在一个完整的离线大数据处理系统中，除了hdfs+mapreduce+hive组成分析系统的核心之外，还需要数据采集、结果数据导出、任务调度等不可或缺的辅助系统，
而这些辅助工具在hadoop生态体系中都有便捷的开源框架，如图所示：

2. Flume基本介绍

1. 概述

Flume是一个分布式、可靠、和高可用的海量日志采集、聚合和传输的系统。
Flume可以采集文件，socket数据包、文件、文件夹、kafka等各种形式源数据，又可以将采集到的数据(下沉sink)输出到HDFS、hbase、hive、kafka等众多外部存储系统中
一般的采集需求，通过对flume的简单配置即可实现
Flume针对特殊场景也具备良好的自定义扩展能力，因此，flume可以适用于大部分的日常数据采集场景

2. 运行机制

Flume分布式系统中最核心的角色是agent，flume采集系统就是由一个个agent所连接起来形成的
每一个agent相当于一个数据传递员，内部有三个组件：

Source：采集组件，用于跟数据源对接，以获取数据
Sink：下沉组件，用于往下一级agent传递数据或者往最终存储系统传递数据
Channel：传输通道组件，用于从source将数据传递到sink

日志采集架构对比日志采集系统_hadoop

3. Flume采集系统结构图

1. 简单结构

单个agent采集数据

2. 复杂结构

两个agent之间串联
多级agent之间串联
多级channel

Flume的安装部署

第一步：下载解压修改配置文件

Flume的安装非常简单，只需要解压即可
上传安装包到数据源所在节点上
这里我们在第三台机器hadoop03来进行安装

cd /bigdata/soft
tar -xzvf apache-flume-1.9.0-bin.tar.gz -C /bigdata/install/
cd /bigdata/install/apache-flume-1.9.0-bin/conf/
cp flume-env.sh.template flume-env.sh
vim flume-env.sh

修改如下内容

export JAVA_HOME=/kkb/install/jdk1.8.0_141

2. 解决jar包冲突

apache-flume-1.9.0-bin、hadoop-3.1.4都有guava包，但是版本不一致，会造成冲突
解决冲突；将hadoop中高版本的guava包，替换flume中低版本的包

cd /bigdata/install/flume-1.9.0/lib
rm -f guava-11.0.2.jar
cp /bigdata/install/hadoop-3.1.4/share/hadoop/common/lib/guava-27.0-jre.jar .

Flume实战案例 -- 从网卡某个端口采集数据到控制台

需求：配置我们的网络收集的配置文件；从某socket端口采集数据，采集到的数据打印到console控制台
在flume的conf目录下新建一个配置文件（采集方案）

cd /bigdata/install/flume-1.9.0/conf
vim netcat-logger.conf

内容如下

# 定义这个agent中各组件的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 描述和配置source组件：r1
a1.sources.r1.type = netcat
# 当前节点的ip地址
a1.sources.r1.bind = hadoop03
a1.sources.r1.port = 44444

# 描述和配置sink组件：k1
a1.sinks.k1.type = logger

# 描述和配置channel组件，此处使用是内存缓存的方式
a1.channels.c1.type = memory
# channel中存储的event的最大个数
a1.channels.c1.capacity = 1000
# channel每次从source获得的event最多个数或一次发往sink的event最多个数
a1.channels.c1.transactionCapacity = 100

# 描述和配置source  channel   sink之间的连接关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

对应类型组件的官网文档
netcat-tcp-source
logger-sink
memory-channel

第三步：启动配置文件

指定采集方案配置文件，在相应的节点上启动flume agent
先用一个最简单的例子来测试一下程序环境是否正常
启动agent去采集数据

bin/flume-ng agent -c conf -f conf/netcat-logger.conf -n a1 -Dflume.root.logger=INFO,console

-c conf 指定flume自身的conf目录中的配置文件
-f conf/netcat-logger.con 指定我们所描述的采集方案
-n a1 指定我们这个agent的名字
-Dflume.root.logger=INFO,console 将info级别的日志打印到控制台

第四步：安装telent准备测试

在hadoop02机器上面安装telnet客户端，用于模拟数据的发送

sudo yum -y install telnet
telnet hadoop03 44444  # 使用telnet模拟数据发送

具体结果如下图所示

Flume实战案例 -- 采集某个目录到HDFS

需求分析

采集需求：某服务器的某特定目录下，会不断产生新的文件，每当有新文件出现，就需要把文件采集到HDFS中去
结构示意图：

日志采集架构对比日志采集系统_hadoop_02

根据需求，首先定义以下3大要素

数据源组件，即source ——监控文件目录 : spooldir

spooldir特性：
1、监视一个目录，只要目录中出现新文件，就会采集文件中的内容
2、采集完成的文件，会被agent自动添加一个后缀：COMPLETED
3、此source可靠，不会丢失数据；即使flume重启或被kill
注意：
所监视的目录中不允许有同名的文件；且文件被放入spooldir后，就不能修改
①如果文件放入spooldir后，又向文件写入数据，会打印错误及停止
②如果有同名的文件出现在spooldir，也会打印错误及停止

下沉组件，即sink——HDFS文件系统 : hdfs sink
通道组件，即channel——可用file channel 也可以用内存channel

flume配置文件开发

配置文件编写：

cd /bigdata/install/flume-1.9.0/conf/
mkdir -p /bigdata/install/mydata/flume/dirfile
vim spooldir.conf

内容如下

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
# 注意：不能往监控目中重复丢同名文件
a1.sources.r1.type = spooldir
# 监控的路径
a1.sources.r1.spoolDir = /bigdata/install/mydata/flume/dirfile
# Whether to add a header storing the absolute path filename
#文件绝对路径放到header
a1.sources.r1.fileHeader = true

# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
#采集到的数据写入到次路径
a1.sinks.k1.hdfs.path = hdfs://hadoop01:8020/spooldir/files/%y-%m-%d/%H%M/    
# 指定在hdfs上生成的文件名前缀
a1.sinks.k1.hdfs.filePrefix = events-
# timestamp向下舍round down
a1.sinks.k1.hdfs.round = true
# 按10分钟，为单位向下取整；如55分，舍成50；38 -> 30
a1.sinks.k1.hdfs.roundValue = 10
# round的单位
a1.sinks.k1.hdfs.roundUnit = minute
# 每3秒滚动生成一个文件；默认30；(0 = never roll based on time interval)
a1.sinks.k1.hdfs.rollInterval = 3
# 每x字节，滚动生成一个文件；默认1024；(0: never roll based on file size)
a1.sinks.k1.hdfs.rollSize = 20
# 每x个event，滚动生成一个文件；默认10； (0 = never roll based on number of events)
a1.sinks.k1.hdfs.rollCount = 5
# 每x个event，flush到hdfs
a1.sinks.k1.hdfs.batchSize = 1
# 使用本地时间
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#生成的文件类型，默认是Sequencefile；可选DataStream，则为普通文本；可选CompressedStream压缩数据
a1.sinks.k1.hdfs.fileType = DataStream

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
# channel中存储的event的最大数目
a1.channels.c1.capacity = 1000
# 每次传输数据，从source最多获得event的数目或向sink发送的event的最大的数目
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

组件官网地址：
spooling directory source
hdfs sink
memory channel

Channel参数解释：

capacity：默认该通道中最大的可以存储的event数量
trasactionCapacity：每次最大可以从source中拿到或者送到sink中的event数量
keep-alive：event添加到通道中或者移出的允许时间

启动flume

cd /bigdata/install/flume-1.9.0
bin/flume-ng agent -c ./conf -f ./conf/spooldir.conf -n a1 -Dflume.root.logger=INFO,console

上传文件到指定目录

将不同的文件上传到下面目录里面去，注意文件不能重名

mkdir -p /home/hadoop/datas
cd /home/hadoop/datas
vim a.txt

# 加入如下内容
ab cd ef
english math
hadoop alibaba

再执行；

cp a.txt /bigdata/install/mydata/flume/dirfile

然后观察flume的console动静、hdfs webui生成的文件
观察spooldir的目标目录
将同名文件再次放到/bigdata/install/mydata/flume/dirfile观察现象：

cp a.txt /bigdata/install/mydata/flume/dirfile

flume控制台报错

Flume实战案例 -- 采集文件到HDFS

需求分析：

采集需求：比如业务系统使用log4j生成的日志，日志内容不断增加，需要把追加到日志文件中的数据实时采集到hdfs

日志采集架构对比日志采集系统_日志采集架构对比_03

根据需求，首先定义以下3大要素

采集源，即source——监控文件内容更新 : exec ‘tail -f file’
下沉目标，即sink——HDFS文件系统 : hdfs sink
Source和sink之间的传递通道——channel，可用file channel 也可以用内存channel

flume的配置文件开发

hadoop03开发配置文件

cd /bigdata/install/flume-1.9.0/conf
vim tail-file.conf

配置文件内容

agent1.sources = source1
agent1.sinks = sink1
agent1.channels = channel1

# Describe/configure tail -F source1
agent1.sources.source1.type = exec
agent1.sources.source1.command = tail -f /bigdata/install/mydata/flume/taillogs/access_log
agent1.sources.source1.channels = channel1

# Describe sink1
agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path = hdfs://hadoop01:8020/weblog/flume-collection/%y-%m-%d/%H-%M
agent1.sinks.sink1.hdfs.filePrefix = access_log
# 允许打开的文件数；如果超出5000，老文件会被关闭
agent1.sinks.sink1.hdfs.maxOpenFiles = 5000
agent1.sinks.sink1.hdfs.batchSize= 100
agent1.sinks.sink1.hdfs.fileType = DataStream
agent1.sinks.sink1.hdfs.writeFormat =Text
agent1.sinks.sink1.hdfs.rollSize = 102400
agent1.sinks.sink1.hdfs.rollCount = 1000000
agent1.sinks.sink1.hdfs.rollInterval = 60
agent1.sinks.sink1.hdfs.round = true
agent1.sinks.sink1.hdfs.roundValue = 10
agent1.sinks.sink1.hdfs.roundUnit = minute
agent1.sinks.sink1.hdfs.useLocalTimeStamp = true

# Use a channel which buffers events in memory
agent1.channels.channel1.type = memory
# 向channel添加一个event或从channel移除一个event的超时时间
agent1.channels.channel1.keep-alive = 120
agent1.channels.channel1.capacity = 5000    ##设置过大，效果不是太明显
agent1.channels.channel1.transactionCapacity = 4500

# Bind the source and sink to the channel
agent1.sources.source1.channels = channel1
agent1.sinks.sink1.channel = channel1

组件官网：
hdfs sink
memory channel

启动flume

cd /bigdata/install/flume-1.9.0
bin/flume-ng agent -c conf -f conf/tail-file.conf -n agent1 -Dflume.root.logger=INFO,console

开发shell脚本定时追加文件内容

mkdir -p /home/hadoop/shells/
cd /home/hadoop/shells/
vim tail-file.sh

内容如下

#!/bin/bash
while true
do
 date >> /bigdata/install/mydata/flume/taillogs/access_log;
  sleep 0.5;
done

创建文件夹

mkdir -p /bigdata/install/mydata/flume/taillogs/

启动脚本

chmod u+x tail-file.sh 
sh /home/hadoop/shells/tail-file.sh

验证结果，在hdfs的webui下和console下可以看到如下截图

Flume实战案例 -- 从HDFS上读取某个文件到本地目录

需求分析

我们从HDFS上的特定目录下的文件，读取到本地目录下的特定目录下
根据需求，首先定义以下3大要素

数据源组件，即source ——监控HDFS目录文件 : exec 'tail -f'
下沉组件，即sink—— file roll sink
通道组件，即channel——可用file channel 也可以用内存channel

flume配置文件开发

配置文件编写：

cd /bigdata/install/flume-1.9.0/conf/
vim hdfs2local.conf

内容如下

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
# 注意：不能往监控目中重复丢同名文件
a1.sources.r1.type = exec
a1.sources.r1.command = hdfs dfs -tail -f /hdfs2flume/test/a.txt
a1.sources.r1.channels = c1

# sink 配置信息
a1.sinks.k1.type = file_roll
a1.sinks.k1.channel = c1
a1.sinks.k1.sink.directory = /bigdata/install/mydata/flume/hdfs2local

a1.sinks.k1.sink.rollInterval = 3600
a1.sinks.k1.sink.pathManager.prefix = event-
a1.sinks.k1.sink.serializer = TEXT
a1.sinks.k1.sink.batchSize = 100

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
# channel中存储的event的最大数目
a1.channels.c1.capacity = 1000
# 每次传输数据，从source最多获得event的数目或向sink发送的event的最大的数目
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

准备HDFS文件信息

vi a.txt

#输入一下内容，保存并推送到HDFS上
1  zhangsan  21
2  lisi  22
3  wangwu  23
4  zhaoliu  24
5  guangyunchang  25
6  gaojianli  27

hdfs dfs -put ./a.txt /hdfs2flume/test/a.txt

mkdir -p /bigdata/install/mydata/flume/hdfs2local

启动flume

cd /bigdata/install/flume-1.9.0
bin/flume-ng agent -c ./conf -f ./conf/hdfs2local.conf -n a1 -Dflume.root.logger=INFO,console

追加hdfs上a.txt文件内容，验证本地目录文件夹，如下图

日志采集架构对比日志采集系统_hadoop_04

第一步：下载解压修改配置文件

Flume的安装非常简单，只需要解压即可
上传安装包到数据源所在节点上
这里我们在第三台机器hadoop03来进行安装

cd /bigdata/soft
tar -xzvf apache-flume-1.9.0-bin.tar.gz -C /bigdata/install/
cd /bigdata/install/apache-flume-1.9.0-bin/conf/
cp flume-env.sh.template flume-env.sh
vim flume-env.sh

修改如下内容

export JAVA_HOME=/kkb/install/jdk1.8.0_141

2. 解决jar包冲突

apache-flume-1.9.0-bin、hadoop-3.1.4都有guava包，但是版本不一致，会造成冲突
解决冲突；将hadoop中高版本的guava包，替换flume中低版本的包

cd /bigdata/install/flume-1.9.0/lib
rm -f guava-11.0.2.jar
cp /bigdata/install/hadoop-3.1.4/share/hadoop/common/lib/guava-27.0-jre.jar .

需求：配置我们的网络收集的配置文件；从某socket端口采集数据，采集到的数据打印到console控制台
在flume的conf目录下新建一个配置文件（采集方案）

cd /bigdata/install/flume-1.9.0/conf
vim netcat-logger.conf

内容如下

# 定义这个agent中各组件的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 描述和配置source组件：r1
a1.sources.r1.type = netcat
# 当前节点的ip地址
a1.sources.r1.bind = hadoop03
a1.sources.r1.port = 44444

# 描述和配置sink组件：k1
a1.sinks.k1.type = logger

# 描述和配置channel组件，此处使用是内存缓存的方式
a1.channels.c1.type = memory
# channel中存储的event的最大个数
a1.channels.c1.capacity = 1000
# channel每次从source获得的event最多个数或一次发往sink的event最多个数
a1.channels.c1.transactionCapacity = 100

# 描述和配置source  channel   sink之间的连接关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

对应类型组件的官网文档
netcat-tcp-source
logger-sink
memory-channel

第三步：启动配置文件

指定采集方案配置文件，在相应的节点上启动flume agent
先用一个最简单的例子来测试一下程序环境是否正常
启动agent去采集数据

bin/flume-ng agent -c conf -f conf/netcat-logger.conf -n a1 -Dflume.root.logger=INFO,console

-c conf 指定flume自身的conf目录中的配置文件
-f conf/netcat-logger.con 指定我们所描述的采集方案
-n a1 指定我们这个agent的名字
-Dflume.root.logger=INFO,console 将info级别的日志打印到控制台

第四步：安装telent准备测试

在hadoop02机器上面安装telnet客户端，用于模拟数据的发送

sudo yum -y install telnet
telnet hadoop03 44444  # 使用telnet模拟数据发送

具体结果如下图所示

Flume实战案例 -- 采集某个目录到HDFS

需求分析

采集需求：某服务器的某特定目录下，会不断产生新的文件，每当有新文件出现，就需要把文件采集到HDFS中去
结构示意图：

日志采集架构对比日志采集系统_hadoop_02

根据需求，首先定义以下3大要素

数据源组件，即source ——监控文件目录 : spooldir

spooldir特性：
1、监视一个目录，只要目录中出现新文件，就会采集文件中的内容
2、采集完成的文件，会被agent自动添加一个后缀：COMPLETED
3、此source可靠，不会丢失数据；即使flume重启或被kill
注意：
所监视的目录中不允许有同名的文件；且文件被放入spooldir后，就不能修改
①如果文件放入spooldir后，又向文件写入数据，会打印错误及停止
②如果有同名的文件出现在spooldir，也会打印错误及停止

下沉组件，即sink——HDFS文件系统 : hdfs sink
通道组件，即channel——可用file channel 也可以用内存channel

flume配置文件开发

配置文件编写：

cd /bigdata/install/flume-1.9.0/conf/
mkdir -p /bigdata/install/mydata/flume/dirfile
vim spooldir.conf

内容如下

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
# 注意：不能往监控目中重复丢同名文件
a1.sources.r1.type = spooldir
# 监控的路径
a1.sources.r1.spoolDir = /bigdata/install/mydata/flume/dirfile
# Whether to add a header storing the absolute path filename
#文件绝对路径放到header
a1.sources.r1.fileHeader = true

# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
#采集到的数据写入到次路径
a1.sinks.k1.hdfs.path = hdfs://hadoop01:8020/spooldir/files/%y-%m-%d/%H%M/    
# 指定在hdfs上生成的文件名前缀
a1.sinks.k1.hdfs.filePrefix = events-
# timestamp向下舍round down
a1.sinks.k1.hdfs.round = true
# 按10分钟，为单位向下取整；如55分，舍成50；38 -> 30
a1.sinks.k1.hdfs.roundValue = 10
# round的单位
a1.sinks.k1.hdfs.roundUnit = minute
# 每3秒滚动生成一个文件；默认30；(0 = never roll based on time interval)
a1.sinks.k1.hdfs.rollInterval = 3
# 每x字节，滚动生成一个文件；默认1024；(0: never roll based on file size)
a1.sinks.k1.hdfs.rollSize = 20
# 每x个event，滚动生成一个文件；默认10； (0 = never roll based on number of events)
a1.sinks.k1.hdfs.rollCount = 5
# 每x个event，flush到hdfs
a1.sinks.k1.hdfs.batchSize = 1
# 使用本地时间
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#生成的文件类型，默认是Sequencefile；可选DataStream，则为普通文本；可选CompressedStream压缩数据
a1.sinks.k1.hdfs.fileType = DataStream

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
# channel中存储的event的最大数目
a1.channels.c1.capacity = 1000
# 每次传输数据，从source最多获得event的数目或向sink发送的event的最大的数目
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

组件官网地址：
spooling directory source
hdfs sink
memory channel

Channel参数解释：

capacity：默认该通道中最大的可以存储的event数量
trasactionCapacity：每次最大可以从source中拿到或者送到sink中的event数量
keep-alive：event添加到通道中或者移出的允许时间

启动flume

cd /bigdata/install/flume-1.9.0
bin/flume-ng agent -c ./conf -f ./conf/spooldir.conf -n a1 -Dflume.root.logger=INFO,console

上传文件到指定目录

将不同的文件上传到下面目录里面去，注意文件不能重名

mkdir -p /home/hadoop/datas
cd /home/hadoop/datas
vim a.txt

# 加入如下内容
ab cd ef
english math
hadoop alibaba

再执行；

cp a.txt /bigdata/install/mydata/flume/dirfile

然后观察flume的console动静、hdfs webui生成的文件
观察spooldir的目标目录
将同名文件再次放到/bigdata/install/mydata/flume/dirfile观察现象：

cp a.txt /bigdata/install/mydata/flume/dirfile

flume控制台报错

Flume实战案例 -- 采集文件到HDFS

需求分析：

采集需求：比如业务系统使用log4j生成的日志，日志内容不断增加，需要把追加到日志文件中的数据实时采集到hdfs

日志采集架构对比日志采集系统_日志采集架构对比_03

根据需求，首先定义以下3大要素

采集源，即source——监控文件内容更新 : exec ‘tail -f file’
下沉目标，即sink——HDFS文件系统 : hdfs sink
Source和sink之间的传递通道——channel，可用file channel 也可以用内存channel

flume的配置文件开发

hadoop03开发配置文件

cd /bigdata/install/flume-1.9.0/conf
vim tail-file.conf

配置文件内容

agent1.sources = source1
agent1.sinks = sink1
agent1.channels = channel1

# Describe/configure tail -F source1
agent1.sources.source1.type = exec
agent1.sources.source1.command = tail -f /bigdata/install/mydata/flume/taillogs/access_log
agent1.sources.source1.channels = channel1

# Describe sink1
agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path = hdfs://hadoop01:8020/weblog/flume-collection/%y-%m-%d/%H-%M
agent1.sinks.sink1.hdfs.filePrefix = access_log
# 允许打开的文件数；如果超出5000，老文件会被关闭
agent1.sinks.sink1.hdfs.maxOpenFiles = 5000
agent1.sinks.sink1.hdfs.batchSize= 100
agent1.sinks.sink1.hdfs.fileType = DataStream
agent1.sinks.sink1.hdfs.writeFormat =Text
agent1.sinks.sink1.hdfs.rollSize = 102400
agent1.sinks.sink1.hdfs.rollCount = 1000000
agent1.sinks.sink1.hdfs.rollInterval = 60
agent1.sinks.sink1.hdfs.round = true
agent1.sinks.sink1.hdfs.roundValue = 10
agent1.sinks.sink1.hdfs.roundUnit = minute
agent1.sinks.sink1.hdfs.useLocalTimeStamp = true

# Use a channel which buffers events in memory
agent1.channels.channel1.type = memory
# 向channel添加一个event或从channel移除一个event的超时时间
agent1.channels.channel1.keep-alive = 120
agent1.channels.channel1.capacity = 5000    ##设置过大，效果不是太明显
agent1.channels.channel1.transactionCapacity = 4500

# Bind the source and sink to the channel
agent1.sources.source1.channels = channel1
agent1.sinks.sink1.channel = channel1

组件官网：
hdfs sink
memory channel

启动flume

cd /bigdata/install/flume-1.9.0
bin/flume-ng agent -c conf -f conf/tail-file.conf -n agent1 -Dflume.root.logger=INFO,console

开发shell脚本定时追加文件内容

mkdir -p /home/hadoop/shells/
cd /home/hadoop/shells/
vim tail-file.sh

内容如下

#!/bin/bash
while true
do
 date >> /bigdata/install/mydata/flume/taillogs/access_log;
  sleep 0.5;
done

创建文件夹

mkdir -p /bigdata/install/mydata/flume/taillogs/

启动脚本

chmod u+x tail-file.sh 
sh /home/hadoop/shells/tail-file.sh

验证结果，在hdfs的webui下和console下可以看到如下截图

Flume实战案例 -- 从HDFS上读取某个文件到本地目录

需求分析

我们从HDFS上的特定目录下的文件，读取到本地目录下的特定目录下
根据需求，首先定义以下3大要素

数据源组件，即source ——监控HDFS目录文件 : exec 'tail -f'
下沉组件，即sink—— file roll sink
通道组件，即channel——可用file channel 也可以用内存channel

flume配置文件开发

配置文件编写：

cd /bigdata/install/flume-1.9.0/conf/
vim hdfs2local.conf

内容如下

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
# 注意：不能往监控目中重复丢同名文件
a1.sources.r1.type = exec
a1.sources.r1.command = hdfs dfs -tail -f /hdfs2flume/test/a.txt
a1.sources.r1.channels = c1

# sink 配置信息
a1.sinks.k1.type = file_roll
a1.sinks.k1.channel = c1
a1.sinks.k1.sink.directory = /bigdata/install/mydata/flume/hdfs2local

a1.sinks.k1.sink.rollInterval = 3600
a1.sinks.k1.sink.pathManager.prefix = event-
a1.sinks.k1.sink.serializer = TEXT
a1.sinks.k1.sink.batchSize = 100

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
# channel中存储的event的最大数目
a1.channels.c1.capacity = 1000
# 每次传输数据，从source最多获得event的数目或向sink发送的event的最大的数目
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

准备HDFS文件信息

vi a.txt

#输入一下内容，保存并推送到HDFS上
1  zhangsan  21
2  lisi  22
3  wangwu  23
4  zhaoliu  24
5  guangyunchang  25
6  gaojianli  27

hdfs dfs -put ./a.txt /hdfs2flume/test/a.txt

mkdir -p /bigdata/install/mydata/flume/hdfs2local

启动flume

cd /bigdata/install/flume-1.9.0
bin/flume-ng agent -c ./conf -f ./conf/hdfs2local.conf -n a1 -Dflume.root.logger=INFO,console

追加hdfs上a.txt文件内容，验证本地目录文件夹，如下图

日志采集架构对比日志采集系统_hadoop_04

需求分析

采集需求：某服务器的某特定目录下，会不断产生新的文件，每当有新文件出现，就需要把文件采集到HDFS中去
结构示意图：

日志采集架构对比日志采集系统_hadoop_02

根据需求，首先定义以下3大要素

数据源组件，即source ——监控文件目录 : spooldir

spooldir特性：
1、监视一个目录，只要目录中出现新文件，就会采集文件中的内容
2、采集完成的文件，会被agent自动添加一个后缀：COMPLETED
3、此source可靠，不会丢失数据；即使flume重启或被kill
注意：
所监视的目录中不允许有同名的文件；且文件被放入spooldir后，就不能修改
①如果文件放入spooldir后，又向文件写入数据，会打印错误及停止
②如果有同名的文件出现在spooldir，也会打印错误及停止

下沉组件，即sink——HDFS文件系统 : hdfs sink
通道组件，即channel——可用file channel 也可以用内存channel

flume配置文件开发

配置文件编写：

cd /bigdata/install/flume-1.9.0/conf/
mkdir -p /bigdata/install/mydata/flume/dirfile
vim spooldir.conf

内容如下

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
# 注意：不能往监控目中重复丢同名文件
a1.sources.r1.type = spooldir
# 监控的路径
a1.sources.r1.spoolDir = /bigdata/install/mydata/flume/dirfile
# Whether to add a header storing the absolute path filename
#文件绝对路径放到header
a1.sources.r1.fileHeader = true

# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
#采集到的数据写入到次路径
a1.sinks.k1.hdfs.path = hdfs://hadoop01:8020/spooldir/files/%y-%m-%d/%H%M/    
# 指定在hdfs上生成的文件名前缀
a1.sinks.k1.hdfs.filePrefix = events-
# timestamp向下舍round down
a1.sinks.k1.hdfs.round = true
# 按10分钟，为单位向下取整；如55分，舍成50；38 -> 30
a1.sinks.k1.hdfs.roundValue = 10
# round的单位
a1.sinks.k1.hdfs.roundUnit = minute
# 每3秒滚动生成一个文件；默认30；(0 = never roll based on time interval)
a1.sinks.k1.hdfs.rollInterval = 3
# 每x字节，滚动生成一个文件；默认1024；(0: never roll based on file size)
a1.sinks.k1.hdfs.rollSize = 20
# 每x个event，滚动生成一个文件；默认10； (0 = never roll based on number of events)
a1.sinks.k1.hdfs.rollCount = 5
# 每x个event，flush到hdfs
a1.sinks.k1.hdfs.batchSize = 1
# 使用本地时间
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#生成的文件类型，默认是Sequencefile；可选DataStream，则为普通文本；可选CompressedStream压缩数据
a1.sinks.k1.hdfs.fileType = DataStream

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
# channel中存储的event的最大数目
a1.channels.c1.capacity = 1000
# 每次传输数据，从source最多获得event的数目或向sink发送的event的最大的数目
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

组件官网地址：
spooling directory source
hdfs sink
memory channel

Channel参数解释：

capacity：默认该通道中最大的可以存储的event数量
trasactionCapacity：每次最大可以从source中拿到或者送到sink中的event数量
keep-alive：event添加到通道中或者移出的允许时间

启动flume

cd /bigdata/install/flume-1.9.0
bin/flume-ng agent -c ./conf -f ./conf/spooldir.conf -n a1 -Dflume.root.logger=INFO,console

上传文件到指定目录

将不同的文件上传到下面目录里面去，注意文件不能重名

mkdir -p /home/hadoop/datas
cd /home/hadoop/datas
vim a.txt

# 加入如下内容
ab cd ef
english math
hadoop alibaba

再执行；

cp a.txt /bigdata/install/mydata/flume/dirfile

然后观察flume的console动静、hdfs webui生成的文件
观察spooldir的目标目录
将同名文件再次放到/bigdata/install/mydata/flume/dirfile观察现象：

cp a.txt /bigdata/install/mydata/flume/dirfile

flume控制台报错

Flume实战案例 -- 采集文件到HDFS

需求分析：

采集需求：比如业务系统使用log4j生成的日志，日志内容不断增加，需要把追加到日志文件中的数据实时采集到hdfs

日志采集架构对比日志采集系统_日志采集架构对比_03

根据需求，首先定义以下3大要素

采集源，即source——监控文件内容更新 : exec ‘tail -f file’
下沉目标，即sink——HDFS文件系统 : hdfs sink
Source和sink之间的传递通道——channel，可用file channel 也可以用内存channel

flume的配置文件开发

hadoop03开发配置文件

cd /bigdata/install/flume-1.9.0/conf
vim tail-file.conf

配置文件内容

agent1.sources = source1
agent1.sinks = sink1
agent1.channels = channel1

# Describe/configure tail -F source1
agent1.sources.source1.type = exec
agent1.sources.source1.command = tail -f /bigdata/install/mydata/flume/taillogs/access_log
agent1.sources.source1.channels = channel1

# Describe sink1
agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path = hdfs://hadoop01:8020/weblog/flume-collection/%y-%m-%d/%H-%M
agent1.sinks.sink1.hdfs.filePrefix = access_log
# 允许打开的文件数；如果超出5000，老文件会被关闭
agent1.sinks.sink1.hdfs.maxOpenFiles = 5000
agent1.sinks.sink1.hdfs.batchSize= 100
agent1.sinks.sink1.hdfs.fileType = DataStream
agent1.sinks.sink1.hdfs.writeFormat =Text
agent1.sinks.sink1.hdfs.rollSize = 102400
agent1.sinks.sink1.hdfs.rollCount = 1000000
agent1.sinks.sink1.hdfs.rollInterval = 60
agent1.sinks.sink1.hdfs.round = true
agent1.sinks.sink1.hdfs.roundValue = 10
agent1.sinks.sink1.hdfs.roundUnit = minute
agent1.sinks.sink1.hdfs.useLocalTimeStamp = true

# Use a channel which buffers events in memory
agent1.channels.channel1.type = memory
# 向channel添加一个event或从channel移除一个event的超时时间
agent1.channels.channel1.keep-alive = 120
agent1.channels.channel1.capacity = 5000    ##设置过大，效果不是太明显
agent1.channels.channel1.transactionCapacity = 4500

# Bind the source and sink to the channel
agent1.sources.source1.channels = channel1
agent1.sinks.sink1.channel = channel1

组件官网：
hdfs sink
memory channel

启动flume

cd /bigdata/install/flume-1.9.0
bin/flume-ng agent -c conf -f conf/tail-file.conf -n agent1 -Dflume.root.logger=INFO,console

开发shell脚本定时追加文件内容

mkdir -p /home/hadoop/shells/
cd /home/hadoop/shells/
vim tail-file.sh

内容如下

#!/bin/bash
while true
do
 date >> /bigdata/install/mydata/flume/taillogs/access_log;
  sleep 0.5;
done

创建文件夹

mkdir -p /bigdata/install/mydata/flume/taillogs/

启动脚本

chmod u+x tail-file.sh 
sh /home/hadoop/shells/tail-file.sh

验证结果，在hdfs的webui下和console下可以看到如下截图

Flume实战案例 -- 从HDFS上读取某个文件到本地目录

需求分析

我们从HDFS上的特定目录下的文件，读取到本地目录下的特定目录下
根据需求，首先定义以下3大要素

数据源组件，即source ——监控HDFS目录文件 : exec 'tail -f'
下沉组件，即sink—— file roll sink
通道组件，即channel——可用file channel 也可以用内存channel

flume配置文件开发

配置文件编写：

cd /bigdata/install/flume-1.9.0/conf/
vim hdfs2local.conf

内容如下

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
# 注意：不能往监控目中重复丢同名文件
a1.sources.r1.type = exec
a1.sources.r1.command = hdfs dfs -tail -f /hdfs2flume/test/a.txt
a1.sources.r1.channels = c1

# sink 配置信息
a1.sinks.k1.type = file_roll
a1.sinks.k1.channel = c1
a1.sinks.k1.sink.directory = /bigdata/install/mydata/flume/hdfs2local

a1.sinks.k1.sink.rollInterval = 3600
a1.sinks.k1.sink.pathManager.prefix = event-
a1.sinks.k1.sink.serializer = TEXT
a1.sinks.k1.sink.batchSize = 100

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
# channel中存储的event的最大数目
a1.channels.c1.capacity = 1000
# 每次传输数据，从source最多获得event的数目或向sink发送的event的最大的数目
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

准备HDFS文件信息

vi a.txt

#输入一下内容，保存并推送到HDFS上
1  zhangsan  21
2  lisi  22
3  wangwu  23
4  zhaoliu  24
5  guangyunchang  25
6  gaojianli  27

hdfs dfs -put ./a.txt /hdfs2flume/test/a.txt

mkdir -p /bigdata/install/mydata/flume/hdfs2local

启动flume

cd /bigdata/install/flume-1.9.0
bin/flume-ng agent -c ./conf -f ./conf/hdfs2local.conf -n a1 -Dflume.root.logger=INFO,console

追加hdfs上a.txt文件内容，验证本地目录文件夹，如下图

日志采集架构对比日志采集系统_hadoop_04

需求分析：

采集需求：比如业务系统使用log4j生成的日志，日志内容不断增加，需要把追加到日志文件中的数据实时采集到hdfs

日志采集架构对比日志采集系统_日志采集架构对比_03

根据需求，首先定义以下3大要素

采集源，即source——监控文件内容更新 : exec ‘tail -f file’
下沉目标，即sink——HDFS文件系统 : hdfs sink
Source和sink之间的传递通道——channel，可用file channel 也可以用内存channel

flume的配置文件开发

hadoop03开发配置文件

cd /bigdata/install/flume-1.9.0/conf
vim tail-file.conf

配置文件内容

agent1.sources = source1
agent1.sinks = sink1
agent1.channels = channel1

# Describe/configure tail -F source1
agent1.sources.source1.type = exec
agent1.sources.source1.command = tail -f /bigdata/install/mydata/flume/taillogs/access_log
agent1.sources.source1.channels = channel1

# Describe sink1
agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path = hdfs://hadoop01:8020/weblog/flume-collection/%y-%m-%d/%H-%M
agent1.sinks.sink1.hdfs.filePrefix = access_log
# 允许打开的文件数；如果超出5000，老文件会被关闭
agent1.sinks.sink1.hdfs.maxOpenFiles = 5000
agent1.sinks.sink1.hdfs.batchSize= 100
agent1.sinks.sink1.hdfs.fileType = DataStream
agent1.sinks.sink1.hdfs.writeFormat =Text
agent1.sinks.sink1.hdfs.rollSize = 102400
agent1.sinks.sink1.hdfs.rollCount = 1000000
agent1.sinks.sink1.hdfs.rollInterval = 60
agent1.sinks.sink1.hdfs.round = true
agent1.sinks.sink1.hdfs.roundValue = 10
agent1.sinks.sink1.hdfs.roundUnit = minute
agent1.sinks.sink1.hdfs.useLocalTimeStamp = true

# Use a channel which buffers events in memory
agent1.channels.channel1.type = memory
# 向channel添加一个event或从channel移除一个event的超时时间
agent1.channels.channel1.keep-alive = 120
agent1.channels.channel1.capacity = 5000    ##设置过大，效果不是太明显
agent1.channels.channel1.transactionCapacity = 4500

# Bind the source and sink to the channel
agent1.sources.source1.channels = channel1
agent1.sinks.sink1.channel = channel1

组件官网：
hdfs sink
memory channel

启动flume

cd /bigdata/install/flume-1.9.0
bin/flume-ng agent -c conf -f conf/tail-file.conf -n agent1 -Dflume.root.logger=INFO,console

开发shell脚本定时追加文件内容

mkdir -p /home/hadoop/shells/
cd /home/hadoop/shells/
vim tail-file.sh

内容如下

#!/bin/bash
while true
do
 date >> /bigdata/install/mydata/flume/taillogs/access_log;
  sleep 0.5;
done

创建文件夹

mkdir -p /bigdata/install/mydata/flume/taillogs/

启动脚本

chmod u+x tail-file.sh 
sh /home/hadoop/shells/tail-file.sh

验证结果，在hdfs的webui下和console下可以看到如下截图

Flume实战案例 -- 从HDFS上读取某个文件到本地目录

需求分析

我们从HDFS上的特定目录下的文件，读取到本地目录下的特定目录下
根据需求，首先定义以下3大要素

数据源组件，即source ——监控HDFS目录文件 : exec 'tail -f'
下沉组件，即sink—— file roll sink
通道组件，即channel——可用file channel 也可以用内存channel

flume配置文件开发

配置文件编写：

cd /bigdata/install/flume-1.9.0/conf/
vim hdfs2local.conf

内容如下

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
# 注意：不能往监控目中重复丢同名文件
a1.sources.r1.type = exec
a1.sources.r1.command = hdfs dfs -tail -f /hdfs2flume/test/a.txt
a1.sources.r1.channels = c1

# sink 配置信息
a1.sinks.k1.type = file_roll
a1.sinks.k1.channel = c1
a1.sinks.k1.sink.directory = /bigdata/install/mydata/flume/hdfs2local

a1.sinks.k1.sink.rollInterval = 3600
a1.sinks.k1.sink.pathManager.prefix = event-
a1.sinks.k1.sink.serializer = TEXT
a1.sinks.k1.sink.batchSize = 100

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
# channel中存储的event的最大数目
a1.channels.c1.capacity = 1000
# 每次传输数据，从source最多获得event的数目或向sink发送的event的最大的数目
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

准备HDFS文件信息

vi a.txt

#输入一下内容，保存并推送到HDFS上
1  zhangsan  21
2  lisi  22
3  wangwu  23
4  zhaoliu  24
5  guangyunchang  25
6  gaojianli  27

hdfs dfs -put ./a.txt /hdfs2flume/test/a.txt

mkdir -p /bigdata/install/mydata/flume/hdfs2local

启动flume

cd /bigdata/install/flume-1.9.0
bin/flume-ng agent -c ./conf -f ./conf/hdfs2local.conf -n a1 -Dflume.root.logger=INFO,console

追加hdfs上a.txt文件内容，验证本地目录文件夹，如下图

日志采集架构对比日志采集系统_hadoop_04

需求分析

我们从HDFS上的特定目录下的文件，读取到本地目录下的特定目录下
根据需求，首先定义以下3大要素

数据源组件，即source ——监控HDFS目录文件 : exec 'tail -f'
下沉组件，即sink—— file roll sink
通道组件，即channel——可用file channel 也可以用内存channel

flume配置文件开发

配置文件编写：

cd /bigdata/install/flume-1.9.0/conf/
vim hdfs2local.conf

内容如下

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
# 注意：不能往监控目中重复丢同名文件
a1.sources.r1.type = exec
a1.sources.r1.command = hdfs dfs -tail -f /hdfs2flume/test/a.txt
a1.sources.r1.channels = c1

# sink 配置信息
a1.sinks.k1.type = file_roll
a1.sinks.k1.channel = c1
a1.sinks.k1.sink.directory = /bigdata/install/mydata/flume/hdfs2local

a1.sinks.k1.sink.rollInterval = 3600
a1.sinks.k1.sink.pathManager.prefix = event-
a1.sinks.k1.sink.serializer = TEXT
a1.sinks.k1.sink.batchSize = 100

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
# channel中存储的event的最大数目
a1.channels.c1.capacity = 1000
# 每次传输数据，从source最多获得event的数目或向sink发送的event的最大的数目
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

准备HDFS文件信息

vi a.txt

#输入一下内容，保存并推送到HDFS上
1  zhangsan  21
2  lisi  22
3  wangwu  23
4  zhaoliu  24
5  guangyunchang  25
6  gaojianli  27

hdfs dfs -put ./a.txt /hdfs2flume/test/a.txt

mkdir -p /bigdata/install/mydata/flume/hdfs2local

启动flume

cd /bigdata/install/flume-1.9.0
bin/flume-ng agent -c ./conf -f ./conf/hdfs2local.conf -n a1 -Dflume.root.logger=INFO,console

追加hdfs上a.txt文件内容，验证本地目录文件夹，如下图

日志采集架构对比日志采集系统_hadoop_04

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：jquery框架应用实训总结基于jquery的框架

下一篇：蓝桥杯python组的获奖比例蓝桥杯有python组吗

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯

日志采集架构对比 日志采集系统

日志采集架构对比 日志采集系统

1. 前言

2. Flume基本介绍

1. 概述

2. 运行机制

3. Flume采集系统结构图

1. 简单结构

2. 复杂结构

Flume的安装部署

第一步：下载解压修改配置文件

2. 解决jar包冲突

Flume实战案例 -- 从网卡某个端口采集数据到控制台

第三步：启动配置文件

第四步：安装telent准备测试

Flume实战案例 -- 采集某个目录到HDFS

需求分析

flume配置文件开发

启动flume

上传文件到指定目录

Flume实战案例 -- 采集文件到HDFS

需求分析：

flume的配置文件开发

启动flume

开发shell脚本定时追加文件内容

Flume实战案例 -- 从HDFS上读取某个文件到本地目录

需求分析

flume配置文件开发

第一步：下载解压修改配置文件

2. 解决jar包冲突

第三步：启动配置文件

第四步：安装telent准备测试

Flume实战案例 -- 采集某个目录到HDFS

需求分析

flume配置文件开发

启动flume

上传文件到指定目录

Flume实战案例 -- 采集文件到HDFS

需求分析：

flume的配置文件开发

启动flume

开发shell脚本定时追加文件内容

Flume实战案例 -- 从HDFS上读取某个文件到本地目录

需求分析

flume配置文件开发

需求分析

flume配置文件开发

启动flume

上传文件到指定目录

Flume实战案例 -- 采集文件到HDFS

需求分析：

flume的配置文件开发

启动flume

开发shell脚本定时追加文件内容

Flume实战案例 -- 从HDFS上读取某个文件到本地目录

需求分析

flume配置文件开发

需求分析：

flume的配置文件开发

启动flume

开发shell脚本定时追加文件内容

Flume实战案例 -- 从HDFS上读取某个文件到本地目录

需求分析

flume配置文件开发

需求分析

flume配置文件开发

51CTO博客

日志采集架构对比日志采集系统

日志采集架构对比日志采集系统