这里主要是三个常见的需求:

监听端口收集数据

监听文件收集数据

监听文件数据转向其他机器

Flume安装前置条件

    Java Runtime Environment - Java 1.7 or later

    Memory - Sufficient memory for configurations used by sources, channels or sinks

    Disk Space - Sufficient disk space for configurations used by channels or sinks

    Directory Permissions - Read/Write permissions for directories used by agent

安装jdk

下载

解压到~/app

将java配置系统环境变量中: ~/.bash_profile

export JAVA_HOME=/home/hadoop/app/jdk1.8.0_144

export PATH=$JAVA_HOME/bin:$PATH

source下让其配置生效

检测: java  -version

安装Flume

下载

解压到~/app

将java配置系统环境变量中: ~/.bash_profile

export FLUME_HOME=/home/hadoop/app/apache-flume-1.6.0-cdh5.7.0-bin

export PATH=$FLUME_HOME/bin:$PATH

source下让其配置生效

flume-env.sh的配置:export JAVA_HOME=/home/hadoop/app/jdk1.8.0_144

检测: flume-ng version

example.conf: A single-node Flume configuration

使用Flume的关键就是写配置文件

A) 配置Source

B) 配置Channel

C) 配置Sink

D) 把以上三个组件串起来

需求一:从指定网络端口采集数据输出到控制台

a1: agent名称 

r1: source的名称

k1: sink的名称

c1: channel的名称

# Name the components on this agent

a1.sources = r1

a1.sinks = k1

a1.channels = c1

# Describe/configure the source

a1.sources.r1.type =

netcat

a1.sources.r1.bind = hadoop000

a1.sources.r1.port = 44444

# Describe the sink

a1.sinks.k1.type = logger

# Use a channel which buffers events in memory

a1.channels.c1.type = memory

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

启动agentflume-ng agent \--name a1  \--conf $FLUME_HOME/conf  \--conf-file $FLUME_HOME/conf/example.conf \-Dflume.root.logger=INFO,console使用telnet进行测试: telnet hadoop000 44444Event: { headers:{} body: 68 65 6C 6C 6F 0D hello. }Event是FLume数据传输的基本单元

Event =  可选的header + byte array

需求二:监控一个文件实时采集新增的数据输出到控制台

Agent选型:exec source + memory channel + logger sink

# Name the components on this agent

a1.sources = r1

a1.sinks = k1

a1.channels = c1

# Describe/configure the source

a1.sources.r1.type =

exec

a1.sources.r1.command = tail -F /home/hadoop/data/data.log

a1.sources.r1.shell = /bin/sh -c

# Describe the sink

a1.sinks.k1.type = logger

# Use a channel which buffers events in memory

a1.channels.c1.type = memory

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

启动agent

flume-ng agent \

--name a1  \

--conf $FLUME_HOME/conf  \

--conf-file $FLUME_HOME/conf/exec-memory-logger.conf \

-Dflume.root.logger=INFO,console

测试:新创建第二个终端,输入echo hello >> exec-memory-logger.conf ,看第一个终端是否能获取到

注意如果走离线的,就将收集到的数据存入到hdfs,如果走实时的,就将数据存入kafka(要设置sink类型和server)

以上只要修改a1.sinks.k1.type即可

需求三:将A服务器上的日志实时采集到B服务器

技术选型:

exec source + memory channel + avro sink

avro source + memory channel + logger sink

技术原理图:

大数据-Flume(分布式日志收集框架)_数据

flume配置:

第一台机器上:

配置文件名:exec-memory-avro.conf

exec-memory-avro.sources = exec-source

exec-memory-avro.sinks = avro-sink

exec-memory-avro.channels = memory-channel

exec-memory-avro.sources.exec-source.type =

exec

exec-memory-avro.sources.exec-source.command = tail -F /home/hadoop/data/data.log

exec-memory-avro.sources.exec-source.shell = /bin/sh -c

exec-memory-avro.sinks.avro-sink.type =

avro

exec-memory-avro.sinks.avro-sink.hostname = hadoop000

exec-memory-avro.sinks.avro-sink.port = 44444

exec-memory-avro.channels.memory-channel.type = memory

exec-memory-avro.sources.exec-source.channels = memory-channel

exec-memory-avro.sinks.avro-sink.channel = memory-channel

第二台机器:

配置文件名:avro-memory-logger.conf

avro-memory-logger.sources =

avro-source

avro-memory-logger.sinks = logger-sink

avro-memory-logger.channels = memory-channel

avro-memory-logger.sources.avro-source.type =

avro

avro-memory-logger.sources.avro-source.bind = hadoop000

avro-memory-logger.sources.avro-source.port = 44444

avro-memory-logger.sinks.logger-sink.type = logger

avro-memory-logger.channels.memory-channel.type = memory

avro-memory-logger.sources.avro-source.channels = memory-channel

avro-memory-logger.sinks.logger-sink.channel = memory-channel

先启动avro-memory-logger

flume-ng agent \

--name avro-memory-logger  \

--conf $FLUME_HOME/conf  \

--conf-file $FLUME_HOME/conf/avro-memory-logger.conf \

-Dflume.root.logger=INFO,console

flume-ng agent \

--name exec-memory-avro  \

--conf $FLUME_HOME/conf  \

--conf-file $FLUME_HOME/conf/exec-memory-avro.conf \

-Dflume.root.logger=INFO,console

需求三的实现流程:

1)机器上A上监控呢一个文件,当我们访问主站时会有用户行为日志记录到access.log中

2)avro sink把新产生的日志输出到对应的avro source指定的hostname和port上

3)通过avro source对应的agent将我们的日志输出到控制台(kafka)