Flume简介和配置
官网地址:http://flume.apache.org/
Flume是什么
Flume是一个分布式数据收集框架。
Flume是一种分布式的、可靠的、可用的服务,可以有效地收集、聚合和移动大量的日志数据。
收集(collecting): — 数据源 source
聚合(aggregating): — 存储 channel
移动(moving ): — 使用 sink
学习flume其实就是学习source、channel、sink的组合。
flume是框架。框架本身是没有source、channel、sink的组合关系的。框架要是使用source、channel、sink的组合,就必须是我们通过配置文件告诉框架。
学习flume其实就是学习source、channel、sink的组合配置。
channel管道存储的数据一旦被sink,就没有了。
channel是一种被动的状态,只负责存储数据。
Event 和 agent
Flume event被定义为具有字节有效负载(payload)和一 组可选字符串属性的数据流单元。
Flume event = payload(数据) + 属性
Flume agent是一个(JVM)进程,它承载着事件从外部源 流到下一个目的地(hop)所经过的组件。
Flume Sources
NetCat Source
采集网络数据—控制台日志输出
- 在/home/hadoop/apps/flume/conf目录下编写配置文件
vim flume-net-log.conf
# flume需要配置source、channel、sink
# 一个flume中可以有多个source、channel、sink
# 所以source、channel、sink需要取名字
# 定义source、channel、sink
# a1是agent的名字
# r1是source的名字
# c1是channle的名字
# s1是sink的名字
a1.sources = r1
a1.channels = c1
a1.sinks = k1
# 配置source是什么数据源
# 这里的netcat就是服务端,需要安装netcat
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop101
a1.sources.r1.port = 6666
# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000
# 配置sink
a1.sinks.k1.type = logger
# source写入哪一个channel
# sink从哪一个channel获取数据
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
- 使用以上的配置文件启动flume
flume-ng agent --conf ./ --conf-file ./flume-net-log.conf --name a1 -Dflume.root.logger=INFO,console
## 简化命令
[hadoop@hadoop101 conf]$ flume-ng agent -n a1 -c ./ -f flume-net-log.conf
-Dflume.root.logger=INFO,console
yum失效解决办法
wget -O /etc/yum.repos.d/CentOS-Base.repo http://file.kangle.odata.cc/repo/Centos6.repo
wget -O /etc/yum.repos.d/epel.repo http://file.kangle.odata.cc/repo/epel6.repo
yum makecache
安装NetCat
- 解压netcat安装包netcat-0.7.1.tar.gz(直接解压到当前目录,还需要编译和安装)
[hadoop@hadoop101 installPkg]$ tar -zxvf netcat-0.7.1.tar.gz
- 配置安装路径
[hadoop@hadoop101 netcat-0.7.1]$ ./configure --prefix=/home/hadoop/apps/netcat/
- 编译和安装(因为src目录下是C语言,编译需要先安装gcc)
[hadoop@hadoop101 src]$ make && make install
- 配置环境变量
[hadoop@hadoop101 bin]$ sudo vim /etc/profile
## netcat的环境变量
export NETCAT_HOME=/home/hadoop/apps/netcat
export PATH=$PATH:$NETCAT_HOME/bin
[hadoop@hadoop101 bin]$ . /etc/profile
- netcat以socket客户端的身份启动(这里的hadoop101和端口号6666,是自己创建的flume配置文件flume-net-log.conf里的配置)
[hadoop@hadoop101 ~]$ nc hadoop101 6666
hello world
OK
- 控制台输出结果
2020-12-17 14:26:35,639 (lifecycleSupervisor-1-2) [INFO - org.apache.flume.source.NetcatSource.start(NetcatSource.java:169)] Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/192.168.152.81:6666]
2020-12-17 14:27:13,644 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)]
Event: { headers:{} body: 68 65 6C 6C 6F 20 77 6F 72 6C 64 hello world }
Exec Source
监控文件数据—控制台日志输出
exec数据源
vim flume-exec-log.conf
a1.sources = r1
a1.channels = c1
a1.sinks = k1
# 配置source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /home/hadoop/data/access.log
# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
# 配置sink
a1.sinks.k1.type = logger
# source写入哪一个channel
# sink从哪一个channel获取数据
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
向access.log中写入数据,服务端控制台就会输出相应的数据
[hadoop@hadoop101 data]$ echo java > flume.txt
2020-12-17 15:10:54,796 (lifecycleSupervisor-1-2) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:95)] Component type: SOURCE, name: r1 started
2020-12-17 15:11:24,807 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)]
Event: { headers:{} body: 6A 61 76 61 java }
Spooling Directory Source
- 与Exec源不同,该源是可靠的,不会丢失数据,即使Flume被重新启动或终止。
- 该目录中的文件必须是不可变的、唯一命名的文件
- 文件完成后会重命名文件
flume-spool-log.conf
a1.sources = r1
a1.channels = c1
a1.sinks = k1
# 配置source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /home/hadoop/data/flumeSpool
a1.sources.r1.fileHeader = true
a1.sources.r1.basenameHeader = true
## 忽略以.tmp结尾的文件
a1.sources.r1.ignorePattern = ^.*\\.tmp$
# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
# 配置sink
a1.sinks.k1.type = logger
# source写入哪一个channel
# sink从哪一个channel获取数据
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
Taildir Source – 重点
flume1.7新增加的数据源
监视指定的文件,一旦检测到添加到每个文件中的新行,就近乎实时地跟踪它们。如果正在写入新行,则此源将重新尝试读取它们,直到写入完成。
这个源是可靠的,即使在拖尾文件旋转(指flume停止,文件现在依然在写入数据)时也不会丢失数据。
它以JSON格式定期地将每个文件的最后读取位置写入给定位置文件。如果flume由于某种原因停止或停机,它可以从写入现有位置文件的位置重新开始跟踪。
此源文件不会重命名、删除或对被跟踪的文件进行任何修改。目前这个源不支持跟踪二进制文件。它逐行读取文本文件。
flume-taildir-log.conf
a1.sources = r1
a1.channels = c1
a1.sinks = k1
# 配置source
a1.sources.r1.type = TAILDIR
a1.sources.r1.positionFile = /home/hadoop/apps/flume/conf/taildir_position.json
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /home/hadoop/data/word.txt
a1.sources.r1.headers.f1.headerKey1 = value1
a1.sources.r1.filegroups.f2 = /home/hadoop/data/wc.txt
a1.sources.r1.headers.f2.headerKey1 = value2
a1.sources.r1.fileHeader = true
# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
# 配置sink
a1.sinks.k1.type = logger
# source写入哪一个channel
# sink从哪一个channel获取数据
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
Flume Channels
Memory Channel
事件存储在内存队列中。可以实现高吞吐;但是flume 失败了数据丢失。
File Channel
https://blogs.apache.org/flume/entry/apache_flume_filechannel
MemoryChannel提供高吞吐量,但在崩溃或断电时会丢失数据。因此,需要建立一个持久的渠道。
FileChannel的目标是提供可靠的高吞吐量通道。FileChannel保证在提交事务时不会由于后续崩溃或断电而丢失任何数据。
taildir-file-log.conf
a1.sources = r1
a1.channels = c1
a1.sinks = k1
# 配置source
a1.sources.r1.type = TAILDIR
a1.sources.r1.positionFile = /home/hadoop/apps/flume/conf/taildir_position.json
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /home/hadoop/data/word.txt
a1.sources.r1.headers.f1.headerKey1 = value1
a1.sources.r1.filegroups.f2 = /home/hadoop/data/wc.txt
a1.sources.r1.headers.f2.headerKey1 = value2
a1.sources.r1.fileHeader = true
# 配置channel
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /home/hadoop/apps/flume/checkpoint
a1.channels.c1.dataDirs = /home/hadoop/apps/flume/data
# 配置sink
a1.sinks.k1.type = logger
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
Flume Sinks
HDFS Sink
这个sink将事件写入Hadoop分布式文件系统(HDFS)。
它目前支持创建文本和序列文件。它支持两种文件类型的压缩。
可以根据运行时间、数据大小或事件数量定期滚动文件(关闭当前文件并创建一个新文件)。
net-meme-hdfs.conf
a1.sources = r1
a1.channels = c1
a1.sinks = k1
# 配置source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop101
a1.sources.r1.port = 6666
# 配置channel
a1.channels.c1.type = memory
# 配置sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M
a1.sinks.k1.hdfs.filePrefix = events-
# 目录滚动的配置
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
# 文件滚动的配置
# 按照时间(s)滚动,0表示禁用
a1.sinks.k1.hdfs.rollInterval = 30
# 按照大小滚动,0表示禁用
a1.sinks.k1.hdfs.rollSize = 1024
# 按照数量滚动,0表示禁用
a1.sinks.k1.hdfs.rollCount = 10
a1.sinks.k1.hdfs.useLocalTimeStamp = true
# 配置文件类型
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
File Roll Sink
在本地文件系统上存储事件
net-mem-file.conf
a1.sources = r1
a1.channels = c1
a1.sinks = k1
# 配置source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop101
a1.sources.r1.port = 6666
# 配置channel
a1.channels.c1.type = memory
# 配置sink
a1.sinks.k1.type = file_roll
a1.sinks.k1.sink.directory = /home/hadoop/data/flume
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
AsyncHBaseSink
此接收器使用异步模型将数据写入HBase。
net-mem-hbase.conf
a1.sources = r1
a1.channels = c1
a1.sinks = k1
# 配置source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop101
a1.sources.r1.port = 6666
# 配置channel
a1.channels.c1.type = memory
# 配置sink
a1.sinks.k1.type = asynchbase
a1.sinks.k1.table = foo_table
a1.sinks.k1.columnFamily = bar_cf
a1.sinks.k1.serializer = org.apache.flume.sink.hbase.SimpleAsyncHbaseEventSerializer
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
flume串联
Flume一个数据源对应多个channel,多个sink的叫扇出(fan out);
多个source配一个channel和一个sinks,这叫扇入(fan in);
但是不能同时多个source配多个channel和多个sinks。
multi-agent flow
第一个flume在hadoop101启动
第二个flume在hadoop102启动
注意:先启动hadoop102的flume
net-mem-avro.conf
a1.sources = r1
a1.channels = c1
a1.sinks = k1
# 配置source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop101
a1.sources.r1.port = 6666
# 配置channel
a1.channels.c1.type = memory
# 配置sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102
a1.sinks.k1.port = 4545
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
avro-mem-log.conf
a1.sources = r1
a1.channels = c1
a1.sinks = k1
# 配置source
a1.sources.r1.type = avro
a1.sources.r1.bind = hadoop102
a1.sources.r1.port = 4545
# 配置channel
a1.channels.c1.type = memory
# 配置sink
a1.sinks.k1.type = logger
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
Multiplexing the flow(多路复用流)
net-channels-sinks.conf
a1.sources = r1
a1.channels = c1 c2 c3
a1.sinks = k1 k2 k3
# 配置source是什么数据源
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop101
a1.sources.r1.port = 6666
# 配置channel
a1.channels.c1.type = memory
a1.channels.c2.type = memory
a1.channels.c3.type = memory
# 配置sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M
a1.sinks.k1.hdfs.filePrefix = events-
# 目录滚动的配置
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
# 文件滚动的配置
# 按照时间(s)滚动,0表示禁用
a1.sinks.k1.hdfs.rollInterval = 30
# 按照大小滚动,0表示禁用
a1.sinks.k1.hdfs.rollSize = 1024
# 按照数量滚动,0表示禁用
a1.sinks.k1.hdfs.rollCount = 10
a1.sinks.k1.hdfs.useLocalTimeStamp = true
# 配置文件类型
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k2.type = logger
a1.sinks.k3.type = avro
a1.sinks.k3.hostname = hadoop102
a1.sinks.k3.port = 4545
# source写入哪一个channel
# sink从哪一个channel获取数据
a1.sources.r1.channels = c1 c2 c3
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2
a1.sinks.k3.channel = c3
拦截器
Timestamp Interceptor
这个拦截器将在事件头中插入它处理事件的时间(单位为 毫秒)。
net-timestamp.conf
a1.sources = r1
a1.channels = c1
a1.sinks = k1
# 配置source是什么数据源
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop101
a1.sources.r1.port = 6666
# 配置拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp
# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
# 配置sink
a1.sinks.k1.type = logger
# source写入哪一个channel
# sink从哪一个channel获取数据
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
Host Interceptor
net-timestamp-host.conf
a1.sources = r1
a1.channels = c1
a1.sinks = k1
# 配置source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop101
a1.sources.r1.port = 6666
# 配置拦截器
a1.sources.r1.interceptors = i1 i2
a1.sources.r1.interceptors.i1.type = timestamp
a1.sources.r1.interceptors.i2.type = host
a1.sources.r1.interceptors.i2.hostHeader = hostname
a1.sources.r1.interceptors.i2.useIP = false
# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
# 配置sink
a1.sinks.k1.type = logger
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
Static Interceptor
自定义header的信息
net-static.conf
a1.sources = r1
a1.channels = c1
a1.sinks = k1
# 配置source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop101
a1.sources.r1.port = 6666
# 配置拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = author
a1.sources.r1.interceptors.i1.value = lee
# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
# 配置sink
a1.sinks.k1.type = logger
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
自定义拦截器
- 编写一个类实现Interceptor,将文件数据转换成json格式(参照HostInterceptor类的源码进行修改)
先导入依赖
<dependency>
<groupId>org.apache.flume</groupId>
<artifactId>flume-ng-core</artifactId>
<version>1.7.0</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.72</version>
</dependency>
package com.bigdata.demo;
import com.alibaba.fastjson.JSONObject;
import com.google.common.collect.Lists;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;
import java.io.UnsupportedEncodingException;
import java.util.HashMap;
import java.util.List;
public class LogInterceptor implements Interceptor {
private String colName;
private String separator;
private HashMap<String,Object> map;
private LogInterceptor(String colName, String separator){
this.colName = colName;
this.separator = separator;
}
@Override
public void initialize() {
map = new HashMap<>();
}
@Override
public Event intercept(Event event) {
map.clear();
byte[] body = event.getBody();
try {
String data = new String(body,"UTF-8");
String[] datas = data.split(separator);
String[] fields = colName.split(",");
if(fields.length != datas.length){
return null;
}
for (int i = 0; i < datas.length; i++) {
map.put(fields[i],datas[i]);
}
//将map --> json
String json = JSONObject.toJSONString(map);
//将json设置到event的body上
event.setBody(json.getBytes("UTF-8"));
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
return event;
};
@Override
public List<Event> intercept(List<Event> events) {
List<Event> out = Lists.newArrayList();
for (Event event : events) {
Event outEvent = intercept(event);
if (outEvent != null) {
out.add(outEvent);
}
}
return out;
}
@Override
public void close() {
//no-op
}
public static class Builder implements Interceptor.Builder {
private String colName;
private String separator;
@Override
public Interceptor build() {
return new LogInterceptor(colName,separator);
}
@Override
public void configure(Context context) {
colName = context.getString("colName","");
separator = context.getString("separator"," ");
}
}
}
- 将代码打成jar包,添加到flume的lib目录下
- 编写配置
a1.sources = r1
a1.channels = c1
a1.sinks = k1
# 配置source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop101
a1.sources.r1.port = 6666
# 配置拦截器
a1.sources.r1.interceptors = i1
# 这里通过类的全限定名获取的是class文件,不是java文件,内部类前需加$符号
a1.sources.r1.interceptors.i1.type = com.bigdata.demo.LogInterceptor$Builder
a1.sources.r1.interceptors.i1.colName =id,name,age
a1.sources.r1.interceptors.i1.separator =,
# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
# 配置sink
a1.sinks.k1.type = file_roll
a1.sinks.k1.sink.directory = /home/hadoop/data
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
自定义hbase序列化器
- 实现AsyncHbaseEventSerializer,将数据写入hbase表中(参照SimpleAsyncHbaseEventSerializer类的源码)
先导入依赖
<dependency>
<groupId>org.apache.flume.flume-ng-sinks</groupId>
<artifactId>flume-ng-hbase-sink</artifactId>
<version>1.7.0</version>
</dependency>
package com.bigdata.demo;
import com.google.common.base.Charsets;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.FlumeException;
import org.apache.flume.conf.ComponentConfiguration;
import org.apache.flume.sink.hbase.AsyncHbaseEventSerializer;
import org.hbase.async.AtomicIncrementRequest;
import org.hbase.async.PutRequest;
import java.util.ArrayList;
import java.util.List;
public class LogHbaseEventSerializer implements AsyncHbaseEventSerializer {
private byte[] table;
private byte[] cf;
private byte[] payload;
private byte[] incrementColumn;
private byte[] incrementRow;
private String separator;
private String pCol;
@Override
public void initialize(byte[] table, byte[] cf) {
this.table = table;
this.cf = cf;
}
@Override
public List<PutRequest> getActions() {
List<PutRequest> actions = new ArrayList<PutRequest>();
if (pCol != null) {
byte[] rowKey;
try {
//获取rowkey,使用采集数据中的用户id当作rowkey
//得到id
String data = new String(payload);
String[] strings = data.split(separator);
String[] fields = pCol.split(",");
if(strings.length != fields.length){
return actions;
}
String id = strings[0];
rowKey = id.getBytes("UTF-8");
for (int i = 0; i < strings.length; i++) {
PutRequest putRequest = new PutRequest(table, rowKey, cf,
fields[i].getBytes("UTF-8"), strings[i].getBytes("UTF-8"));
actions.add(putRequest);
}
} catch (Exception e) {
throw new FlumeException("Could not get row key!", e);
}
}
return actions;
}
public List<AtomicIncrementRequest> getIncrements() {
List<AtomicIncrementRequest> actions = new ArrayList<AtomicIncrementRequest>();
if (incrementColumn != null) {
AtomicIncrementRequest inc = new AtomicIncrementRequest(table,
incrementRow, cf, incrementColumn);
actions.add(inc);
}
return actions;
}
@Override
public void cleanUp() {
// TODO Auto-generated method stub
}
@Override
public void configure(Context context) {
//HBase的列名称
pCol = context.getString("colName", "pCol");
//flume采集数据的分隔符
separator = context.getString("separator", ",");
String iCol = context.getString("incrementColumn", "iCol");
if (iCol != null && !iCol.isEmpty()) {
incrementColumn = iCol.getBytes(Charsets.UTF_8);
}
incrementRow = context.getString("incrementRow", "incRow").getBytes(Charsets.UTF_8);
}
@Override
public void setEvent(Event event) {
this.payload = event.getBody();
}
@Override
public void configure(ComponentConfiguration conf) {
// TODO Auto-generated method stub
}
}
- 将代码打成jar包,添加到flume的lib目录下
- 编写配置
a1.sources = r1
a1.channels = c1
a1.sinks = k1
# 配置source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop101
a1.sources.r1.port = 6666
# 配置channel
a1.channels.c1.type = memory
# 配置sink
a1.sinks.k1.type = asynchbase
a1.sinks.k1.table = myhbase
a1.sinks.k1.columnFamily = c
a1.sinks.k1.serializer = com.bigdata.demo.LogHbaseEventSerializer
a1.sinks.k1.serializer.colName = id,name,age
a1.sinks.k1.serializer.separator = ,
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
Agent的内部原理
Flume的故障转移和负载均衡
使用sink组
故障转移
使用sink组对应一个channel,sink组中只能有一个sink在take数据。如果该sink出现了故障,sink组中的可以使用另一个sink来take数据。
sink有一个与之相关的优先级,数量越大,优先级越高。
failover.conf:
a1.sources = r1
a1.channels = c1
a1.sinks = k1 k2
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
# 配置source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop101
a1.sources.r1.port = 6666
# 配置channel
a1.channels.c1.type = memory
# 配置sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102
a1.sinks.k1.port = 4545
a1.sinks.k2.type = file_roll
a1.sinks.k2.sink.directory = /home/hadoop/data/flume
# 配置故障转移
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 50
a1.sinkgroups.g1.processor.priority.k2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1
负载均衡
负载均衡sink处理器提供了在多个sink上实现负载均衡流的能力。
它维护一个活动接收器的索引列表,负载必须分布在该列表上。
实现支持通过round_robin或random选择机制分配负载。
选择机制默认为round_robin类型,但可以通过配置重写。
a1.sources = r1
a1.channels = c1
a1.sinks = k1 k2
# 配置source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop101
a1.sources.r1.port = 6666
# 配置channel
a1.channels.c1.type = memory
# 配置负载均衡
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = true
a1.sinkgroups.g1.processor.selector = round_robin
# 配置sink
a1.sinks.k1.type = file_roll
a1.sinks.k1.sink.directory = /home/hadoop/data/flume01
a1.sinks.k2.type = file_roll
a1.sinks.k2.sink.directory = /home/hadoop/data/flume
# 绑定channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1