大数据-Flume（分布式日志收集框架）

原创

qq59caeb714a7a4 2022-09-18 02:11:50 博主文章分类：大数据 ©著作权

文章标签 大数据 hadoop java 数据 文章分类 运维

©著作权归作者所有：来自51CTO博客作者qq59caeb714a7a4的原创作品，请联系作者获取转载授权，否则将追究法律责任

这里主要是三个常见的需求：

监听端口收集数据

监听文件收集数据

监听文件数据转向其他机器

Flume安装前置条件

Java Runtime Environment - Java 1.7 or later

Memory - Sufficient memory for configurations used by sources, channels or sinks

Disk Space - Sufficient disk space for configurations used by channels or sinks

Directory Permissions - Read/Write permissions for directories used by agent

安装jdk

下载

解压到~/app

将java配置系统环境变量中: ~/.bash_profile

export JAVA_HOME=/home/hadoop/app/jdk1.8.0_144

export PATH=$JAVA_HOME/bin:$PATH

source下让其配置生效

检测: java -version

安装Flume

下载

解压到~/app

将java配置系统环境变量中: ~/.bash_profile

export FLUME_HOME=/home/hadoop/app/apache-flume-1.6.0-cdh5.7.0-bin

export PATH=$FLUME_HOME/bin:$PATH

source下让其配置生效

flume-env.sh的配置：export JAVA_HOME=/home/hadoop/app/jdk1.8.0_144

检测: flume-ng version

example.conf: A single-node Flume configuration

使用Flume的关键就是写配置文件

A）配置Source

B）配置Channel

C）配置Sink

D）把以上三个组件串起来

需求一:从指定网络端口采集数据输出到控制台

a1: agent名称

r1: source的名称

k1: sink的名称

c1: channel的名称

# Name the components on this agent

a1.sources = r1

a1.sinks = k1

a1.channels = c1

# Describe/configure the source

a1.sources.r1.type =

netcat

a1.sources.r1.bind = hadoop000

a1.sources.r1.port = 44444

# Describe the sink

a1.sinks.k1.type = logger

# Use a channel which buffers events in memory

a1.channels.c1.type = memory

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

启动agentflume-ng agent \--name a1 \--conf $FLUME_HOME/conf \--conf-file $FLUME_HOME/conf/example.conf \-Dflume.root.logger=INFO,console使用telnet进行测试： telnet hadoop000 44444Event: { headers:{} body: 68 65 6C 6C 6F 0D hello. }Event是FLume数据传输的基本单元

Event = 可选的header + byte array

需求二：监控一个文件实时采集新增的数据输出到控制台

Agent选型：exec source + memory channel + logger sink

# Name the components on this agent

a1.sources = r1

a1.sinks = k1

a1.channels = c1

# Describe/configure the source

a1.sources.r1.type =

exec

a1.sources.r1.command = tail -F /home/hadoop/data/data.log

a1.sources.r1.shell = /bin/sh -c

# Describe the sink

a1.sinks.k1.type = logger

# Use a channel which buffers events in memory

a1.channels.c1.type = memory

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

启动agent

flume-ng agent \

--name a1 \

--conf $FLUME_HOME/conf \

--conf-file $FLUME_HOME/conf/exec-memory-logger.conf \

-Dflume.root.logger=INFO,console

测试：新创建第二个终端，输入echo hello >> exec-memory-logger.conf ，看第一个终端是否能获取到

注意如果走离线的，就将收集到的数据存入到hdfs，如果走实时的，就将数据存入kafka(要设置sink类型和server)

以上只要修改a1.sinks.k1.type即可

需求三：将A服务器上的日志实时采集到B服务器

技术选型：

exec source + memory channel + avro sink

avro source + memory channel + logger sink

技术原理图：

大数据-Flume（分布式日志收集框架）_数据

flume配置：

第一台机器上：

配置文件名：exec-memory-avro.conf

exec-memory-avro.sources = exec-source

exec-memory-avro.sinks = avro-sink

exec-memory-avro.channels = memory-channel

exec-memory-avro.sources.exec-source.type =

exec

exec-memory-avro.sources.exec-source.command = tail -F /home/hadoop/data/data.log

exec-memory-avro.sources.exec-source.shell = /bin/sh -c

exec-memory-avro.sinks.avro-sink.type =

avro

exec-memory-avro.sinks.avro-sink.hostname = hadoop000

exec-memory-avro.sinks.avro-sink.port = 44444

exec-memory-avro.channels.memory-channel.type = memory

exec-memory-avro.sources.exec-source.channels = memory-channel

exec-memory-avro.sinks.avro-sink.channel = memory-channel

第二台机器：

配置文件名：avro-memory-logger.conf

avro-memory-logger.sources =

avro-source

avro-memory-logger.sinks = logger-sink

avro-memory-logger.channels = memory-channel

avro-memory-logger.sources.avro-source.type =

avro

avro-memory-logger.sources.avro-source.bind = hadoop000

avro-memory-logger.sources.avro-source.port = 44444

avro-memory-logger.sinks.logger-sink.type = logger

avro-memory-logger.channels.memory-channel.type = memory

avro-memory-logger.sources.avro-source.channels = memory-channel

avro-memory-logger.sinks.logger-sink.channel = memory-channel

先启动avro-memory-logger

flume-ng agent \

--name avro-memory-logger \

--conf $FLUME_HOME/conf \

--conf-file $FLUME_HOME/conf/avro-memory-logger.conf \

-Dflume.root.logger=INFO,console

flume-ng agent \

--name exec-memory-avro \

--conf $FLUME_HOME/conf \

--conf-file $FLUME_HOME/conf/exec-memory-avro.conf \

-Dflume.root.logger=INFO,console

需求三的实现流程：

1）机器上A上监控呢一个文件，当我们访问主站时会有用户行为日志记录到access.log中

2）avro sink把新产生的日志输出到对应的avro source指定的hostname和port上

3）通过avro source对应的agent将我们的日志输出到控制台（kafka）

上一篇：第四章-分词

下一篇：Centos6.5安装

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯