flume telnet到日志 flume-agent

转载

墨舞青云 2024-08-28 21:15:13

文章标签 flume telnet到日志 flume ci hdfs Source 文章分类 架构后端开发

Flume Agent内部原理

flume telnet到日志 flume-agent_flume

重要组件：（官方文档对应搜索即可）
1）ChannelSelector（搜索flume channel selector）
ChannelSelector的作用就是选出Event将要被发往哪个Channel。其共有两种类型，分别是Replicating（复制）和Multiplexing（多路复用）。（默认Replicating）
ReplicatingSelector会将同一个Event发往所有的Channel

Examples:

a1.sources = r1
a1.channels = c1 c2 c3
a1.sources.r1.selector.type = replicating#这一行不写也行，因为是默认方式；
a1.sources.r1.channels = c1 c2 c3
a1.sources.r1.selector.optional = c3

Multiplexing会根据相应的原则，将不同的Event发往不同的Channel。

Examples:

a1.sources = r1
a1.channels = c1 c2 c3 c4
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = state#（Header是容纳了key-value字符串对的HashMap；其中key便是state)
a1.sources.r1.selector.mapping.CZ = c1#(kv中的v便是CZ和US)
a1.sources.r1.selector.mapping.US = c2 c3#如果v是CZ和US，则对应传入C1、C2、C3 channels
a1.sources.r1.selector.default = c4#如果kv不匹配以上情况，传入到default channel c4；

2）SinkProcessor（搜索Flume Sink Processors）
SinkProcessor共有三种类型，分别是DefaultSinkProcessor、LoadBalancingSinkProcessor和FailoverSinkProcessor
DefaultSinkProcessor：对应的是单个的Sink

LoadBalancingSinkProcessor对应的是Sink Group，可以实现负载均衡的功能（防止单个sink压力过大）（用得最多）

Examples:

a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000 #processor.selector.maxTimeOut 可选参数

FailoverSinkProcessor对应的是Sink Group，FailoverSinkProcessor可以实现错误恢复的功能。（一个active、其他备用）

Examples:

a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = load_balance#默认load_balance
a1.sinkgroups.g1.processor.backoff = true#默认false  退避算法；结合可选参数processor.selector.maxTimeOut使用（以毫秒为单位）；可选参数
a1.sinkgroups.g1.processor.selector = random#默认round_robin轮询方式；还有random和FQCN（用户自定义）可选参数

processor.selector.maxTimeOut参数：已知某个sinkProcessor挂掉了，下一次会退避一个时间，不再访问该sink，且随着访问失败次数的增加，这个退避时间会成指数形式的增长，而这个参数就是限制这个时间的最大值；（防止再次成功启动后，需要等待过长时间，数据才会再次访问这个sink）实际工作中一般会开启这个参数；

为什么用多个Interceptor？

单个Interceptor的效率更高，但是不够灵活，比如有时候需要特定的拦截器方法，或者是拦截器中的某一部分方法的时候，就显得难以操作；而如果将这些功能分开，选择性地进行拦截，可以更方便、更灵活；

Flume拓扑结构

简单串联（用的比较少、属于基础架构）

这种模式是将多个flume（指的是agent）顺序连接起来了，从最初的source开始到最终sink传送的目的存储系统。此模式不建议桥接过多的flume数量， flume数量过多不仅会影响传输速率，而且一旦传输过程中某个节点flume宕机，会影响整个传输系统。

复制和多路复用

想将一份数据发向多个文件管理系统，需要用到多个agent以及Replicating ChannelSelector；

Flume支持将事件流向一个或者多个目的地。这种模式可以将相同数据复制到多个channel中，或者将不同数据分发到不同的channel中，sink可以选择传送到不同的目的地。

负载均衡和故障转移

flume telnet到日志 flume-agent_hdfs_02

Flume支持使用将多个sink逻辑上分到一个sink组，sink组配合不同的SinkProcessor可以实现负载均衡和错误恢复的功能。

比如：如果每个sink后接一个agent，这样每个agent又有一个channel缓冲区，这样就大大解决了sink写入慢导致channel满了的问题；

也可以防止一个sink挂掉了无法传输数据的情况；（一般用来解决负载均衡问题的情况比较多）

聚合

flume telnet到日志 flume-agent_flume telnet到日志_03

这种模式是我们最常见的，也非常实用，日常web应用通常分布在上百个服务器，大者甚至上千个、上万个服务器。产生的日志，处理起来也非常麻烦。用flume的这种组合方式能很好的解决这一问题，每台服务器部署一个flume采集日志，传送到一个或多个集中收集日志的flume，再由此flume上传到hdfs、hive、hbase等，进行日志分析。

Flume企业开发案例

复制

1、案例需求

使用 Flume-1 监控文件变动，Flume-1 将变动内容传递给 Flume-2，Flume-2 负责存储到 HDFS。同时 Flume-1 将变动内容传递给 Flume-3，Flume-3 负责输出到 Local FileSystem。

2、需求分析

flume telnet到日志 flume-agent_flume telnet到日志_04

Flume间信息传递使用的是Avro Source和Avro Sink；

注意这里不能只用一个Memory Channel，然后来传给两个sink；因为这里后续的两个Flume都要接收完整的数据

3、实现步骤：

一、首先创建三个对应的配置文件；

进入job目录下，创建group1（与后面的案例做区分）:

#配置 1 个接收日志文件的 source 和两个 channel、两个 sink，分别输送给 flume2 和 flume3
touch flume1.conf  
    
配置上级 Flume 输出的 Source，输出是到HDFS 的 Sink
touch flume2.conf
    
配置上级 Flume 输出的 Source，输出是到本地目录的 Sink
touch flume3.conf

vim flume1.conf

配置 1 个接收日志文件的 source 和两个 channel、两个Avro sink，分别输送给 flume-flume-hdfs 和 flume-flume-dir。

编辑配置文件：

添加内容：

//官方文档搜索Avro Source
flume1.conf：
    
# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2

# 将数据流复制给所有 channel
# 由于replicating是默认配置，所以也可以不加这个参数；    
a1.sources.r1.selector.type = replicating
    
# Source    
a1.sources.r1.type = TAILDIR
a1.sources.r1.filegroups = f1
#这里的hive.log是自己建立的，因为hive里面的那个文件需要加载的话数量很大；
a1.sources.r1.filegroups.f1 = /opt/module/data/hive.log
a1.sources.r1.positionFile = /opt/module/flume/position/position1.json
    
# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100
    
# Describe the sink
# sink 端的 avro 是一个数据发送者
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102
a1.sinks.k1.port = 4141
    
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop102
a1.sinks.k2.port = 4142    
    
# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
#因为一个sink只能连一个channel，所以这里的channel不加s；    
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

配置 flume2.conf

配置上级 Flume 输出的 Source，输出是到HDFS 的 Sink。

编辑配置文件：

添加内容：

a2.sources = r1
a2.sinks = k1
a2.channels = c1
    
#Source
# source 端的 avro 是一个数据接收服务
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop102
a2.sources.r1.port = 4141
    
# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Describe the sink
a2.sinks.k1.type = hdfs
#HDFS的上传目录可以不存在，会自动创建；    
a2.sinks.k1.hdfs.path = hdfs://hadoop102:8020/group1/%Y%m%d/%H
#上传文件的前缀
a2.sinks.k1.hdfs.filePrefix = logs-
#是否按照时间滚动文件夹
a2.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a2.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a2.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a2.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a2.sinks.k1.hdfs.batchSize = 1000
#设置文件类型，可支持压缩
a2.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a2.sinks.k1.hdfs.rollInterval = 30
#设置每个文件的滚动大小大概是 128M
a2.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a2.sinks.k1.hdfs.rollCount = 0
    
# Bind the source and sink to the channel
a2.sources.r1.channels = c1    
a2.sinks.k1.channel = c1

配置 flume3.conf

配置上级 Flume 输出的 Source，输出是到本地目录的 Sink（官方文档搜索：File Roll Sink）

编辑配置文件：

添加内容：

a3.sources = r1
a3.sinks = k1
a3.channels = c1
    
# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop102
a3.sources.r1.port = 4142

# Describe the channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100
    
# Describe the sink
a3.sinks.k1.type = file_roll
#本地的文件系统必须存在，这里需要先在本地建好目录；    
a3.sinks.k1.sink.directory = /opt/module/datas/group1 
    
# Bind the source and sink to the channel
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

在/opt/module/data目录下创建hive.log文件:touch hive.log;

开启多个控制台窗口：（由于flume开启之后的阻塞性）

注意：这里先开下游flume，再开上游；（Avro Source相当于是服务端，Avro Sink是客户端；先开服务端）

这里也可以做一个测试，先开a1端，发现会报错；4141和4142端口拒绝连接

bin/flume-ng agent -c conf/ -f job/group1/flume3.conf -n a3

bin/flume-ng agent -c conf/ -f job/group1/flume2.conf -n a2

bin/flume-ng agent -c conf/ -f job/group1/flume1.conf -n a1

启动hadoop；

测试：

追加两条数据到Hive.log文件中：

echo hello >> hive.log
echo chenxu >> hive.log

查看本地文件位置：datas/group1以及HDFS上的文件：group1目录下的logs文件

查看文件是否是追加的内容；

注意：本地目录下会滚动生成文件，而HDFS上由于没有新的Event，所以不会滚动生成文件；

负载均衡和故障转移

1、案例需求

使用 Flume1 监控一个端口，其 sink 组中的 sink 分别对接 Flume2 和 Flume3，采用FailoverSinkProcessor，实现故障转移的功能。

FailoverSinkProcessor：可在官网查询FailoverSink Processor进行查询；

先分组，再设定优先级；

Examples:

a1.sinkgroups = g1#分组
a1.sinkgroups.g1.sinks = k1 k2 #组里面对应的sinks（c、k的分组一定要定义好）
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 5#优先级
a1.sinkgroups.g1.processor.priority.k2 = 10#这个参数的配置类似于Taildir Source的f1 f2的配置方法
a1.sinkgroups.g1.processor.maxpenalty = 10000（退避原则；挂掉以后10秒以内还是在失败队列，不予考虑）

2、需求分析

flume telnet到日志 flume-agent_flume telnet到日志_05

3、实现步骤

（1）准备工作

在/opt/module/flume/job 目录下创建 group2 文件夹

touch flume1.conf;

touch flume2.conf;

touch flume3.conf;

vim flume1.conf

编辑配置文件：

添加内容：

flume1.conf：
    
# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1
a1.sinkgroups = g1    

# Source    
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444#监控端口44444；
    
# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
      
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102
a1.sinks.k1.port = 4141
    
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop102
a1.sinks.k2.port = 4142    

#Sink Group
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000
    
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
#因为一个sink只能连一个channel，所以这里的channel不加s；    
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

配置 flume2.conf

编辑配置文件：

添加内容：

a2.sources = r1
a2.sinks = k1
a2.channels = c1
    
#Source
# source 端的 avro 是一个数据接收服务
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop102
a2.sources.r1.port = 4141
    
# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Describe the sink
a2.sinks.k1.type = logger
    
# Bind the source and sink to the channel
a2.sources.r1.channels = c1    
a2.sinks.k1.channel = c1

配置 flume3.conf

编辑配置文件：

添加内容：

a3.sources = r1
a3.sinks = k1
a3.channels = c1
    
# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop102
a3.sources.r1.port = 4142

# Describe the channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100
    
# Describe the sink
a3.sinks.k1.type = logger
    
# Bind the source and sink to the channel
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

4、开启多个控制台窗口：

注意：这里先开下游flume，再开上游；（Avro Source相当于是服务端，Avro Sink是客户端；先开服务端）

bin/flume-ng agent -c conf/ -f job/group2/flume3.conf -n a3 -Dflume.root.logger=INFO,console

bin/flume-ng agent -c conf/ -f job/group2/flume2.conf -n a2 -Dflume.root.logger=INFO,console

bin/flume-ng agent -c conf/ -f job/group2/flume1.conf -n a1

使用 netcat 工具向本机的 44444 端口发送内容：

nc localhost 44444

查看 Flume2 及 Flume3 的控制台打印日志

内容会集中输出到一个日志窗口；

将 Flume2 kill，观察 Flume3 的控制台打印情况。

注：使用 jps -ml 查看 Flume 进程

负载均衡案例只需要修改flume1.conf文件中的Failover Sink的内容即可；

官网搜索Load balancing Sink Processor

a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
    
#将这三行内容替换；    
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = true
a1.sinkgroups.g1.processor.selector = random

其他步骤一样，唯一不同的是日志不再集中输出到一个窗口，而是随机输出；

聚合

1、案例需求

hadoop102 上的 Flume-1 监控文件/opt/module/data/group.log

hadoop103 上的 Flume-2 监控某一个端口的数据流，

Flume-1 与 Flume-2 将数据发送给 hadoop104 上的 Flume-3，Flume-3 将最终数据打印到控制台。

Flume-1 与 Flume-2的数据是发送到Flume-3，然后Flume-3在自己的服务器上进行读取内容；Flume-3是服务端（客户端与服务端的远程通信应用思想，后面配置文件会提现到）

2、需求分析

flume telnet到日志 flume-agent_Source_06

3、实现步骤

（1）准备工作

在 hadoop102/opt/module/flume/job目录下创建一个 group3文件夹。

分发flume；

touch flume1.conf;

touch flume2.conf;

touch flume3.conf;

分别对应hadoop102、hadoop103、hadoop104的文件；

vim flume1.conf

编辑配置文件：

添加内容：

在hadoop102服务器的/opt/module/data/目录下创建group.log文件；

# Name the components on this agent
a1.sources = r1
a1.sinks = k1 
a1.channels = c1   
   
# Source    
a1.sources.r1.type = TAILDIR
a1.sources.r1.filegroups = f1
#这里的hive.log是自己建立的，因为hive里面的那个文件需要加载的话数量很大；
a1.sources.r1.filegroups.f1 = /opt/module/data/group.log
a1.sources.r1.positionFile = /opt/module/flume/position/position2.json
    
# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
      
# Describe the sink
a1.sinks.k1.type = avro
#发送到104    
a1.sinks.k1.hostname = hadoop104
a1.sinks.k1.port = 4141 

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
#因为一个sink只能连一个channel，所以这里的channel不加s；    
a1.sinks.k1.channel = c1

配置 flume2.conf

编辑配置文件：

添加内容：

a2.sources = r1
a2.sinks = k1
a2.channels = c1
    
#Netcat Source    
a2.sources.r1.type = netcat
a2.sources.r1.bind = localhost
#监控端口44444；
a2.sources.r1.port = 44444 
    
# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Describe the sink
a2.sinks.k1.type = avro
a2.sinks.k1.hostname = hadoop104
#发送到104    
a2.sinks.k1.port = 4141 
    
# Bind the source and sink to the channel
a2.sources.r1.channels = c1    
a2.sinks.k1.channel = c1

配置 flume3.conf

编辑配置文件：

添加内容：

a3.sources = r1
a3.sinks = k1
a3.channels = c1
    
# Describe the source
a3.sources.r1.type = avro
#接收服务器；    
a3.sources.r1.bind = hadoop104 
#接收端口；
a3.sources.r1.port = 4141

# Describe the channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100
    
# Describe the sink
a3.sinks.k1.type = logger
    
# Bind the source and sink to the channel
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

4、开启多个控制台窗口：

执行配置文件

分别开启对应配置文件：先开104的服务器（服务端），在三台机器上分别对应运行：

（104）bin/flume-ng agent -c conf/ -f job/group3/flume3.conf -n a3 -Dflume.root.logger=INFO,console

（103）bin/flume-ng agent -c conf/ -f job/group3/flume2.conf -n a2 #监控端口

（102）bin/flume-ng agent -c conf/ -f job/group3/flume1.conf -n a1 #监控文件

在hadoop102服务器上向group.log增加内容：echo hello >> group.log;

在hadoop103上向服务器44444端口发送内容：nc localhost 44444; 发送内容；

对应查看104服务器；

自定义Interceptor（拦截器）

1.自定义Interceptor

1）案例需求

使用Flume采集服务器本地日志，需要按照日志类型的不同，将不同种类的日志发往不同的分析系统。

2）需求分析

在实际的开发中，一台服务器产生的日志类型可能有很多种，不同类型的日志可能需要发送到不同的分析系统。此时会用到Flume拓扑结构中的Multiplexing结构，Multiplexing的原理是，根据event中Header的某个key的值，将不同的event发送到不同的Channel中，所以我们需要自定义一个Interceptor，为不同类型的event的Header中的key赋予不同的值。

在该案例中，我们以端口数据模拟日志，以数字（单个）和字母（单个）模拟不同类型的日志，我们需要自定义interceptor区分数字和字母，将其分别发往不同的分析系统（Channel）。
1.自定义Interceptor

1）案例需求

使用Flume采集服务器本地日志，需要按照日志类型的不同，将不同种类的日志发往不同的分析系统。

2）需求分析

在实际的开发中，一台服务器产生的日志类型可能有很多种，不同类型的日志可能需要发送到不同

的分析系统。此时会用到Flume拓扑结构中的Multiplexing结构，Multiplexing的原理是，根据event中

Header的value的值，将不同的event发送到不同的Channel中，所以我们需要自定义一个

Interceptor，为不同类型的event的Header中的value赋予不同的值。

在该案例中，我们以端口数据模拟日志，以发送内容带"hello"还是"chenxu"来模拟不同类型的日志，我

们需要自定义interceptor区分两个字符串，将其分别发往不同的分析系统（Channel）。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-cWOI70Mt-1627220655086)(C:\Users\86157\Desktop\大数据学习\hadoop核心组件\2020082711405469.png)]

在Hadoop-Study-01工程下建立的子工程Flume-Study-01；

建立package：com.chenxu.Interceptor;

创建class：TypeInterceptor；重写四个方法；

package com.chenxu.Interceptor;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;

import java.util.ArrayList;
import java.util.List;
import java.util.Map;

public class TypeInterceptor implements Interceptor {

    //定义一个存放事件的集合：因为intercept(List<Event> events) 是要循环调用的方法，不适合放；initialize()内的声明不是全局变量，所以放在开头声明；
    private List<Event> addHeaderEvents;//不在这里创建，会浪费资源，当调用这个接口并使用时才创建，所以放进初始化过程中；

    @Override
    public void initialize() {

        //初始化
        addHeaderEvents = new ArrayList<>();
    }

    @Override
    public Event intercept(Event event) {

        //获取时间中的Header(kv类型)
        Map<String, String> headers = event.getHeaders();

        //获取事件中的Body(字节数组)
        String body = new String(event.getBody());

        //根据Body当中的内容，进入不同的channels；
        //拦截器的判断部分；
        //添加头信息；
        if(body.contains("hello")){
            headers.put("type","op");//mapping.op = c1;
        }else{
            headers.put("type","np");//mapping.np = c2; 这样就把数据隔离开了；
        }

        return event;
    }

    //可以直接调用intercept(Event event)方法，批量事件拦截；(设定一个ArrayList来起到缓冲的作用）
    @Override
    public List<Event> intercept(List<Event> events) {

        //1、清空集合；
        addHeaderEvents.clear();

        //2、遍历Events，为每一个事件添加头信息；
        for (Event event : events) {
            addHeaderEvents.add(intercept(event));
        }

        return addHeaderEvents;
    }

    @Override
    public void close() {

    }

    //这里的名字并不一定非要是builder，但$后的内容必须与该名字保持一致；
    public static class Builder implements Interceptor.Builder{

        @Override
        public Interceptor build() {
            return new TypeInterceptor();
        }

        @Override
        public void configure(Context context) {

        }
    }
}

拦截器的内容见官方文档：搜索Flume Interceptors（其他带拦截器参数例子的文档也行），查看一下；

a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.interceptors = i1 i2
a1.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.HostInterceptor$Builder
a1.sources.r1.interceptors.i1.preserveExisting = false
a1.sources.r1.interceptors.i1.hostHeader = hostname
a1.sources.r1.interceptors.i2.type = org.apache.flume.interceptor.TimestampInterceptor$Builder #自定义的Interceptor，后面加$Builder ;这里的bulider实际上就是一个自定义方法中的静态内部类；
a1.sinks.k1.filePrefix = FlumeData.%{CollectorHost}.%Y-%m-%d
a1.sinks.k1.channel = c1
    
给拦截器名字，指定拦截器类型；大体上配置自定义拦截器方式都类似；    
要实现org.apache.flume.interceptor.Interceptor接口； 
输入参数是Event，输出参数也是Event；

将写好的TypeInterceptor打包放入集群中；cd /opt/module/flume/lib/目录下放入包；

（3）编辑flume配置文件

需求回顾：使用Flume采集服务器本地日志，需要按照日志类型的不同，将不同种类的日志发往不同的分析系统。

为了方便区分，这里的输出控制台放入hadoop103和hadoop104中；

分别进入三台服务器的job目录下，创建group4目录,分别创建flume1.conf、flume2.conf、flume3.conf

配置flume1.conf：

# Name the components on this agent
a1.sources = r1
a1.channels = c1 c2 
a1.sinks = k1 k2

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
    
#Interceptor
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = com.chenxu.Interceptor.TypeInterceptor$Builder    
    
#Channel Selector
#需要的是MultiPlexing Channel Selector
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = type
a1.sources.r1.selector.mapping.op = c1
a1.sources.r1.selector.mapping.np = c2 
#这里不需要配置a1.sources.r1.selector.default = c4
    

# Use a channel which buffers events in memory   
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type = memory
a1.channels.c2.capacity = 10000
a1.channels.c2.transactionCapacity = 100

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop103
a1.sinks.k1.port = 4142

a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop104
a1.sinks.k2.port = 4142
    
# Bind the source and sink to the channel   
a1.sources.r1.channels = c1 c2 
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

配置flume2.conf

a2.sources = r1
a2.sinks = k1
a2.channels = c1
    
# Describe the source
a2.sources.r1.type = avro
#接收服务器；    
a2.sources.r1.bind = hadoop103 
#接收端口；
a2.sources.r1.port = 4142

# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
    
# Describe the sink
a2.sinks.k1.type = logger
    
# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

配置flume3.conf

a3.sources = r1
a3.sinks = k1
a3.channels = c1
    
# Describe the source
a3.sources.r1.type = avro
#接收服务器；    
a3.sources.r1.bind = hadoop104 
#接收端口；由于这里是两台服务器，所以可以配置同一个端口号；
a3.sources.r1.port = 4142

# Describe the channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100
    
# Describe the sink
a3.sinks.k1.type = logger
    
# Bind the source and sink to the channel
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

同样，先启动下游hadoop103和hadoop104上的flume；

（104）bin/flume-ng agent -c conf/ -f job/group4/flume3.conf -n a3 -Dflume.root.logger=INFO,console

（103）bin/flume-ng agent -c conf/ -f job/group4/flume2.conf -n a2 -Dflume.root.logger=INFO,console

（102）bin/flume-ng agent -c conf/ -f job/group4/flume1.conf -n a1

发现hadoop103和hadoop104服务器上打印了日志，说明连接成功，开始测试：

在hadoop102服务器上：

nc localhost 44444

输入数据

hello world

chenxu qifei 

world

带"hello"的会被打印在hadoop103服务器的控制台中，其他打印在hadoop104的控制台中；

自定义Source

Source回顾；

Source是负责接收数据到Flume Agent的组件。Source组件可以处理各种类型、各种格式的日志数据，包括avro（flume间信息传递）、thrift、exec、jms、spooling directory、netcat、sequence generator、syslog、http、legacy。

官方也提供了自定义source的接口： https://flume.apache.org/FlumeDeveloperGuide.html#source根据官方说明自定义MySource需要继承AbstractSource类并实现Configurable和PollableSource接口。实现相应方法： getBackOffSleepIncrement()//暂不用
getMaxBackOffSleepInterval()//暂不用
configure(Context context)//初始化context（读取配置文件内容）
process()//获取数据封装成event并写入channel，这个方法将被循环调用。
使用场景：读取MySQL数据或者其他文件系统。

需求：

使用flume接收数据，并给每条数据添加前缀，输出到控制台。前缀可从flume配置文件中配置。

这里我们选择自己造一部分数据，循环读取然后导出到控制台；这样以后用JDBC获取数据来替代这部分数据，即可实现MySQL的连接功能；替换成其他数据即可实现其他功能；

flume telnet到日志 flume-agent_flume_07

//进入官方文档Flume Developer Guide查看文档内容：
public class MySource extends AbstractSource implements Configurable, PollableSource {
  private String myProp;

  @Override
  public void configure(Context context) {
      //建立配置关系：第一个参数是key，第二个参数为默认值，
    String myProp = context.getString("myProp", "defaultValue");

    // Process the myProp value (e.g. validation, convert to another type, ...)

    // Store myProp for later retrieval by process() method
    this.myProp = myProp;
  }

    //类似于initialize
  @Override
  public void start() {
    // Initialize the connection to the external client
  }

    //类似于close
  @Override
  public void stop () {
    // Disconnect from external client and do any additional cleanup
    // (e.g. releasing resources or nulling-out field values) ..
  }

    //会被循环调用的方法
  @Override
  public Status process() throws EventDeliveryException {
    Status status = null;

    try {
      // This try clause includes whatever Channel/Event operations you want to do

      // Receive new data
        //获取数据的方法；代码主要修改的地方；
      Event e = getSomeData();

      // Store the Event into this Source's associated Channel(s)
      getChannelProcessor().processEvent(e);

      status = Status.READY;
    } catch (Throwable t) {
      // Log exception, handle individual exceptions as needed

        //如果出现异常，进行退避
      status = Status.BACKOFF;

      // re-throw all Errors
      if (t instanceof Error) {
        throw (Error)t;
      }
    } finally {
      txn.close();
    }
    return status;
  }
}

//自定义source
package com.chenxu.Source;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.EventDeliveryException;
import org.apache.flume.PollableSource;
import org.apache.flume.conf.Configurable;
import org.apache.flume.event.SimpleEvent;
import org.apache.flume.source.AbstractSource;

import java.util.HashMap;

public class MySource extends AbstractSource implements Configurable, PollableSource {

    /*需求分析：
   1、接收数据（for循环造数据）
   2、封装为Event
   3、将事件传递给Channel；
     */

    private String prefix;
    private String subfix;

    //设置参数；
    @Override
    public void configure(Context context) {
        //key如何定义：写成哪一个字符串是无所谓的，只要保持后面conf文件中的后缀与key保持一致即可；（类似k1.type,a1.channel)
        //尽量写成能一眼看出来这个参数是什么意思的形式；
        prefix = context.getString("prefix");
        //如果在配置文件中没有指明subfix这个key对应的值，对应输出"chenxu"默认值；
        subfix = context.getString("subfix","chenxu");
    }

    @Override
    public Status process() throws EventDeliveryException {

        //process方法是可以用到configure里的数据的；

        Status status = null;
        //Crtl + Alt +T可以选择进行包裹；
        try {
            //1、接收数据(造数据）
            for (int i = 0; i < 5; i++) {

                //2、构建事件
                SimpleEvent event = new SimpleEvent();//Event本身是一个接口，SimpleEvent和JSONEvent是它的实现类；

                //创建事件头信息（这里也可以不设置header）
                HashMap<String, String> hearderMap = new HashMap<>();

                //3、给事件设置Body（给Body加一个带默认值的前缀和一个不带默认值的后缀）
                event.setBody((prefix + "--" + i + "--" + subfix).getBytes());

                //将事件写入channel
                getChannelProcessor().processEvent(event);
                //processEvent方法的第一步是做拦截处理，然后做非空判断，再走选择器，开启事务，做put之后进行提交；
                //每一个event都对应一个事务

                //事务正常
                status = Status.READY;

            }
        } catch (Throwable t) {
            // Log exception, handle individual exceptions as needed

            //如果出现异常，进行退避
            status = Status.BACKOFF;

            // re-throw all Errors
            if (t instanceof Error) {
                throw (Error)t;
            }
        }

        //2秒运行一次；
        try {
            Thread.sleep(2000);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        return status;
    }

    //不设置这个参数；
    @Override
    public long getBackOffSleepIncrement() {
        return 0;
    }

    //不设置这个参数；
    @Override
    public long getMaxBackOffSleepInterval() {
        return 0;
    }


}

在job目录下创建mysource.conf

添加内容：

a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = com.chenxu.Source.MySource
a1.sources.r1.prefix = feiji
#a1.sources.r1.subfix= xiaxian

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel    
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动配置文件：

bin/flume-ng agent -c conf/ -f job/mysource.conf -n a1 -Dflume.root.logger=INFO,console

观察是否会自动打印信息；

注意：如果每次输出的信息过程是无法全部打印出来的，需要设置Logger Sink的参数：maxBytesToLog，默认值为16；

自定义sink

Sink回顾：

Sink不断地轮询Channel中的事件且批量地移除它们，并将这些事件批量写入到存储或索引系统、或者被发送到另一个Flume Agent。(avro);
Sink是完全事务性的。在从Channel批量删除数据之前，每个Sink用Channel启动一个事务。批量事件一旦成功写出到存储系统或下一个Flume Agent，Sink就利用Channel提交事务。事务一旦被提交，该Channel从自己的内部缓冲区删除事件。
Sink组件目的地包括hdfs、logger(输出到控制台)、avro（flume间信息传递）、thrift、ipc、file、null、HBase、solr、自定义。Channel是位于Source和Sink之间的缓冲区。因此，Channel允许Source和Sink运作在不同的速率上。Channel是线程安全的，可以同时处理几个Source的写入操作和几个Sink的读取操作。

需求:

使用flume接收数据，并在Sink端给每条数据添加前缀和后缀，输出到控制台。前后缀可在flume任务配置文件中配置。

注意：与自定义Source不同，其事务的查验、提交、put等过程都可以靠ChannelProcessor来完成，但自定义Sink时不能，所以需要在重写process方法时写出这些过程；

代码在com.chenxu.MySink包下；

public class MySink extends AbstractSink implements Configurable {
  private String myProp;

  @Override
  public void configure(Context context) {
    String myProp = context.getString("myProp", "defaultValue");

    // Process the myProp value (e.g. validation)

    // Store myProp for later retrieval by process() method
    this.myProp = myProp;
  }

    //以下两个方法可以不重写；
  @Override
  public void start() {
    // Initialize the connection to the external repository (e.g. HDFS) that
    // this Sink will forward Events to ..
  }

  @Override
  public void stop () {
    // Disconnect from the external respository and do any
    // additional cleanup (e.g. releasing resources or nulling-out
    // field values) ..
  }

  @Override
  public Status process() throws EventDeliveryException {
    Status status = null;

    // Start transaction
    Channel ch = getChannel();
    Transaction txn = ch.getTransaction();
    txn.begin();
    try {
      // This try clause includes whatever Channel operations you want to do

        //与自定义Source不同的是，这里take数据的位置只会是channel；
      Event event = ch.take();

      // Send the Event to the external repository.
      // storeSomeData(e);

      txn.commit();
      status = Status.READY;
    } catch (Throwable t) {
      txn.rollback();

      // Log exception, handle individual exceptions as needed

      status = Status.BACKOFF;

      // re-throw all Errors
      if (t instanceof Error) {
        throw (Error)t;
      }
    }
    return status;
  }
}


//自定义Sink
package com.chenxu.MySink;

import org.apache.flume.*;
import org.apache.flume.conf.Configurable;
import org.apache.flume.sink.AbstractSink;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class MySink extends AbstractSink implements Configurable {

    //获取Logger对象
    private Logger logger = LoggerFactory.getLogger(MySink.class);;

    //设置配置参数；
    private String prefix;
    private String subfix;
    @Override
    public void configure(Context context) {

        prefix = context.getString("prefix");
        subfix = context.getString("subfix","chenxu");

    }

    /*
    1、获取Channel
    2、从Channel获取事务以及数据
    3、发送数据；
     */

    @Override
    public Status process() throws EventDeliveryException {
        Status status = null;

        // Start transaction
        Channel channel = getChannel();
        Transaction transaction = channel.getTransaction();
        transaction.begin();
        try {


            // This try clause includes whatever Channel operations you want to do

            //与自定义Source过程不同的是：自定义的sink只能从channel中take数据；
            Event event = channel.take();

            //这里就是业务逻辑，也即是自定义最重要的部分
            // Send the Event to the external repository.
            // storeSomeData(e);

            //获取事件体
            String body = new String(prefix + new String(event.getBody()) + subfix);
            //输出内容
            logger.info(body);

            transaction.commit();
            status = Status.READY;
        } catch (Throwable t) {
            //出现异常则回滚；
            transaction.rollback();

            // Log exception, handle individual exceptions as needed（可以在这里输出某些信息）

            status = Status.BACKOFF;

            // re-throw all Errors
            if (t instanceof Error) {
                throw (Error)t;
            }
        }finally {
            transaction.close();

        }
        return status;
    }
}

测试：

1、打包
将写好的代码打包，并放到flume的lib目录（/opt/module/flume）下。

2、配置文件

在job目录下创建mysink.conf，添加内容：

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
    
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
    
# Describe the sink
a1.sinks.k1.type = com.chenxu.MySink.MySink
a1.sinks.k1.prefix = feiji
a1.sinks.k1.suffix = jiangluo
    
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
    
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

开启测试：bin/flume-ng agent -c conf/ -f job/mysink.conf -n a1 -Dflume.root.logger=INFO,console;（这里的INFO也可以改成ERROR，表示只打印ERROR内容及比ERROR更严重的内容；）

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。