Flume简介和配置

官网地址:http://flume.apache.org/

Flume是什么

Flume是一个分布式数据收集框架。

Flume是一种分布式的、可靠的、可用的服务,可以有效地收集、聚合和移动大量的日志数据。

收集(collecting): — 数据源 source
聚合(aggregating): — 存储 channel
移动(moving ): — 使用 sink

学习flume其实就是学习source、channel、sink的组合。

flume是框架。框架本身是没有source、channel、sink的组合关系的。框架要是使用source、channel、sink的组合,就必须是我们通过配置文件告诉框架。

学习flume其实就是学习source、channel、sink的组合配置。

channel管道存储的数据一旦被sink,就没有了。

channel是一种被动的状态,只负责存储数据。

Event 和 agent

Flume event被定义为具有字节有效负载(payload)和一 组可选字符串属性的数据流单元。

Flume event = payload(数据) + 属性

Flume agent是一个(JVM)进程,它承载着事件从外部源 流到下一个目的地(hop)所经过的组件。

Flume Sources

NetCat Source

采集网络数据—控制台日志输出
  1. 在/home/hadoop/apps/flume/conf目录下编写配置文件
    vim flume-net-log.conf
# flume需要配置source、channel、sink
# 一个flume中可以有多个source、channel、sink
# 所以source、channel、sink需要取名字
# 定义source、channel、sink
# a1是agent的名字
# r1是source的名字
# c1是channle的名字
# s1是sink的名字
a1.sources = r1
a1.channels = c1
a1.sinks = k1

# 配置source是什么数据源
# 这里的netcat就是服务端,需要安装netcat
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop101
a1.sources.r1.port = 6666

# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000

# 配置sink
a1.sinks.k1.type = logger

# source写入哪一个channel
# sink从哪一个channel获取数据
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
  1. 使用以上的配置文件启动flume
flume-ng agent --conf ./ --conf-file ./flume-net-log.conf --name a1 -Dflume.root.logger=INFO,console

## 简化命令
[hadoop@hadoop101 conf]$ flume-ng agent -n a1 -c ./ -f flume-net-log.conf
-Dflume.root.logger=INFO,console

yum失效解决办法

wget -O /etc/yum.repos.d/CentOS-Base.repo http://file.kangle.odata.cc/repo/Centos6.repo
wget -O /etc/yum.repos.d/epel.repo http://file.kangle.odata.cc/repo/epel6.repo
yum makecache
安装NetCat
  1. 解压netcat安装包netcat-0.7.1.tar.gz(直接解压到当前目录,还需要编译和安装)
[hadoop@hadoop101 installPkg]$ tar -zxvf netcat-0.7.1.tar.gz
  1. 配置安装路径
[hadoop@hadoop101 netcat-0.7.1]$ ./configure --prefix=/home/hadoop/apps/netcat/
  1. 编译和安装(因为src目录下是C语言,编译需要先安装gcc)
[hadoop@hadoop101 src]$ make && make install
  1. 配置环境变量
[hadoop@hadoop101 bin]$ sudo vim /etc/profile
## netcat的环境变量
export NETCAT_HOME=/home/hadoop/apps/netcat
export PATH=$PATH:$NETCAT_HOME/bin

[hadoop@hadoop101 bin]$ . /etc/profile
  1. netcat以socket客户端的身份启动(这里的hadoop101和端口号6666,是自己创建的flume配置文件flume-net-log.conf里的配置)
[hadoop@hadoop101 ~]$ nc hadoop101 6666
hello world
OK
  1. 控制台输出结果
2020-12-17 14:26:35,639 (lifecycleSupervisor-1-2) [INFO - org.apache.flume.source.NetcatSource.start(NetcatSource.java:169)] Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/192.168.152.81:6666]
2020-12-17 14:27:13,644 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] 
Event: { headers:{} body: 68 65 6C 6C 6F 20 77 6F 72 6C 64                hello world }

Exec Source

监控文件数据—控制台日志输出

exec数据源
vim flume-exec-log.conf

a1.sources = r1
a1.channels = c1
a1.sinks = k1

# 配置source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /home/hadoop/data/access.log

# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000

# 配置sink
a1.sinks.k1.type = logger

# source写入哪一个channel
# sink从哪一个channel获取数据
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

向access.log中写入数据,服务端控制台就会输出相应的数据

[hadoop@hadoop101 data]$ echo java > flume.txt
2020-12-17 15:10:54,796 (lifecycleSupervisor-1-2) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:95)] Component type: SOURCE, name: r1 started
2020-12-17 15:11:24,807 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] 
Event: { headers:{} body: 6A 61 76 61                                     java }

Spooling Directory Source

  • 与Exec源不同,该源是可靠的,不会丢失数据,即使Flume被重新启动或终止。
  • 该目录中的文件必须是不可变的、唯一命名的文件
  • 文件完成后会重命名文件

flume-spool-log.conf

a1.sources = r1
a1.channels = c1
a1.sinks = k1

# 配置source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /home/hadoop/data/flumeSpool
a1.sources.r1.fileHeader = true
a1.sources.r1.basenameHeader = true

## 忽略以.tmp结尾的文件
a1.sources.r1.ignorePattern = ^.*\\.tmp$

# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000

# 配置sink
a1.sinks.k1.type = logger

# source写入哪一个channel
# sink从哪一个channel获取数据
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Taildir Source – 重点

flume1.7新增加的数据源

监视指定的文件,一旦检测到添加到每个文件中的新行,就近乎实时地跟踪它们。如果正在写入新行,则此源将重新尝试读取它们,直到写入完成。

这个源是可靠的,即使在拖尾文件旋转(指flume停止,文件现在依然在写入数据)时也不会丢失数据。

它以JSON格式定期地将每个文件的最后读取位置写入给定位置文件。如果flume由于某种原因停止或停机,它可以从写入现有位置文件的位置重新开始跟踪。

此源文件不会重命名、删除或对被跟踪的文件进行任何修改。目前这个源不支持跟踪二进制文件。它逐行读取文本文件。

flume-taildir-log.conf

a1.sources = r1
a1.channels = c1
a1.sinks = k1

# 配置source
a1.sources.r1.type = TAILDIR
a1.sources.r1.positionFile = /home/hadoop/apps/flume/conf/taildir_position.json
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /home/hadoop/data/word.txt
a1.sources.r1.headers.f1.headerKey1 = value1
a1.sources.r1.filegroups.f2 = /home/hadoop/data/wc.txt
a1.sources.r1.headers.f2.headerKey1 = value2
a1.sources.r1.fileHeader = true

# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000

# 配置sink
a1.sinks.k1.type = logger

# source写入哪一个channel
# sink从哪一个channel获取数据
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Flume Channels

Memory Channel

事件存储在内存队列中。可以实现高吞吐;但是flume 失败了数据丢失。

File Channel

https://blogs.apache.org/flume/entry/apache_flume_filechannel

MemoryChannel提供高吞吐量,但在崩溃或断电时会丢失数据。因此,需要建立一个持久的渠道。

FileChannel的目标是提供可靠的高吞吐量通道。FileChannel保证在提交事务时不会由于后续崩溃或断电而丢失任何数据。

taildir-file-log.conf

a1.sources = r1
a1.channels = c1
a1.sinks = k1

# 配置source
a1.sources.r1.type = TAILDIR
a1.sources.r1.positionFile = /home/hadoop/apps/flume/conf/taildir_position.json
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /home/hadoop/data/word.txt
a1.sources.r1.headers.f1.headerKey1 = value1
a1.sources.r1.filegroups.f2 = /home/hadoop/data/wc.txt
a1.sources.r1.headers.f2.headerKey1 = value2
a1.sources.r1.fileHeader = true

# 配置channel
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /home/hadoop/apps/flume/checkpoint
a1.channels.c1.dataDirs = /home/hadoop/apps/flume/data

# 配置sink
a1.sinks.k1.type = logger

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Flume Sinks

HDFS Sink

这个sink将事件写入Hadoop分布式文件系统(HDFS)。

它目前支持创建文本和序列文件。它支持两种文件类型的压缩。

可以根据运行时间、数据大小或事件数量定期滚动文件(关闭当前文件并创建一个新文件)。

net-meme-hdfs.conf

a1.sources = r1
a1.channels = c1
a1.sinks = k1

# 配置source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop101
a1.sources.r1.port = 6666

# 配置channel
a1.channels.c1.type = memory

# 配置sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M
a1.sinks.k1.hdfs.filePrefix = events-

# 目录滚动的配置
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute

# 文件滚动的配置
# 按照时间(s)滚动,0表示禁用
a1.sinks.k1.hdfs.rollInterval = 30

# 按照大小滚动,0表示禁用
a1.sinks.k1.hdfs.rollSize = 1024
# 按照数量滚动,0表示禁用
a1.sinks.k1.hdfs.rollCount = 10
a1.sinks.k1.hdfs.useLocalTimeStamp = true
# 配置文件类型
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

File Roll Sink

在本地文件系统上存储事件

net-mem-file.conf

a1.sources = r1
a1.channels = c1
a1.sinks = k1

# 配置source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop101
a1.sources.r1.port = 6666

# 配置channel
a1.channels.c1.type = memory

# 配置sink
a1.sinks.k1.type = file_roll
a1.sinks.k1.sink.directory = /home/hadoop/data/flume

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

AsyncHBaseSink

此接收器使用异步模型将数据写入HBase。

net-mem-hbase.conf

a1.sources = r1
a1.channels = c1
a1.sinks = k1

# 配置source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop101
a1.sources.r1.port = 6666

# 配置channel
a1.channels.c1.type = memory

# 配置sink
a1.sinks.k1.type = asynchbase
a1.sinks.k1.table = foo_table
a1.sinks.k1.columnFamily = bar_cf
a1.sinks.k1.serializer = org.apache.flume.sink.hbase.SimpleAsyncHbaseEventSerializer

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

flume串联

Flume一个数据源对应多个channel,多个sink的叫扇出(fan out);

多个source配一个channel和一个sinks,这叫扇入(fan in);

但是不能同时多个source配多个channel和多个sinks。

multi-agent flow

flume 配置文件 flume配置参数详解_hdfs

第一个flume在hadoop101启动
第二个flume在hadoop102启动

注意:先启动hadoop102的flume

net-mem-avro.conf

a1.sources = r1
a1.channels = c1
a1.sinks = k1

# 配置source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop101
a1.sources.r1.port = 6666

# 配置channel
a1.channels.c1.type = memory

# 配置sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102
a1.sinks.k1.port = 4545

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

avro-mem-log.conf

a1.sources = r1
a1.channels = c1
a1.sinks = k1

# 配置source
a1.sources.r1.type = avro
a1.sources.r1.bind = hadoop102
a1.sources.r1.port = 4545

# 配置channel
a1.channels.c1.type = memory

# 配置sink
a1.sinks.k1.type = logger

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Multiplexing the flow(多路复用流)

flume 配置文件 flume配置参数详解_flume 配置文件_02

net-channels-sinks.conf

a1.sources = r1
a1.channels = c1 c2 c3
a1.sinks = k1 k2 k3

# 配置source是什么数据源
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop101
a1.sources.r1.port = 6666

# 配置channel
a1.channels.c1.type = memory
a1.channels.c2.type = memory
a1.channels.c3.type = memory

# 配置sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M
a1.sinks.k1.hdfs.filePrefix = events-
# 目录滚动的配置
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
# 文件滚动的配置
# 按照时间(s)滚动,0表示禁用
a1.sinks.k1.hdfs.rollInterval = 30
# 按照大小滚动,0表示禁用
a1.sinks.k1.hdfs.rollSize = 1024
# 按照数量滚动,0表示禁用
a1.sinks.k1.hdfs.rollCount = 10
a1.sinks.k1.hdfs.useLocalTimeStamp = true
# 配置文件类型
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text

a1.sinks.k2.type = logger

a1.sinks.k3.type = avro
a1.sinks.k3.hostname = hadoop102
a1.sinks.k3.port = 4545

# source写入哪一个channel
# sink从哪一个channel获取数据
a1.sources.r1.channels = c1 c2 c3
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2
a1.sinks.k3.channel = c3

拦截器

Timestamp Interceptor

这个拦截器将在事件头中插入它处理事件的时间(单位为 毫秒)。

net-timestamp.conf

a1.sources = r1
a1.channels = c1
a1.sinks = k1

# 配置source是什么数据源
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop101
a1.sources.r1.port = 6666

# 配置拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp

# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000

# 配置sink
a1.sinks.k1.type = logger

# source写入哪一个channel
# sink从哪一个channel获取数据
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Host Interceptor

net-timestamp-host.conf

a1.sources = r1
a1.channels = c1
a1.sinks = k1

# 配置source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop101
a1.sources.r1.port = 6666

# 配置拦截器
a1.sources.r1.interceptors = i1 i2
a1.sources.r1.interceptors.i1.type = timestamp
a1.sources.r1.interceptors.i2.type = host
a1.sources.r1.interceptors.i2.hostHeader = hostname
a1.sources.r1.interceptors.i2.useIP = false

# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000

# 配置sink
a1.sinks.k1.type = logger

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Static Interceptor

自定义header的信息

net-static.conf

a1.sources = r1
a1.channels = c1
a1.sinks = k1

# 配置source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop101
a1.sources.r1.port = 6666

# 配置拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = author
a1.sources.r1.interceptors.i1.value = lee

# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000

# 配置sink
a1.sinks.k1.type = logger

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

自定义拦截器

  1. 编写一个类实现Interceptor,将文件数据转换成json格式(参照HostInterceptor类的源码进行修改)
    先导入依赖
<dependency>
<groupId>org.apache.flume</groupId>
<artifactId>flume-ng-core</artifactId>
<version>1.7.0</version>
</dependency>

<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.72</version>
</dependency>
package com.bigdata.demo;

import com.alibaba.fastjson.JSONObject;
import com.google.common.collect.Lists;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;

import java.io.UnsupportedEncodingException;
import java.util.HashMap;
import java.util.List;

public class LogInterceptor implements Interceptor {

    private String colName;
    private String separator;
    private HashMap<String,Object> map;

    private LogInterceptor(String colName, String separator){
        this.colName = colName;
        this.separator = separator;
    }
    @Override
    public void initialize() {
        map = new HashMap<>();
    }

    @Override
    public Event intercept(Event event) {
        map.clear();
        byte[] body = event.getBody();
        try {
            String data = new String(body,"UTF-8");
            String[] datas = data.split(separator);
            String[] fields = colName.split(",");
            if(fields.length != datas.length){
                return null;
            }
            for (int i = 0; i < datas.length; i++) {
                map.put(fields[i],datas[i]);
            }

            //将map --> json
            String json = JSONObject.toJSONString(map);
            //将json设置到event的body上
            event.setBody(json.getBytes("UTF-8"));

        } catch (UnsupportedEncodingException e) {
            e.printStackTrace();
        }
        return event;
    };

    @Override
    public List<Event> intercept(List<Event> events) {
        List<Event> out = Lists.newArrayList();
        for (Event event : events) {
            Event outEvent = intercept(event);
            if (outEvent != null) {
                out.add(outEvent);
            }
        }
        return out;
    }

    @Override
    public void close() {
        //no-op
    }

    public static class Builder implements Interceptor.Builder {
        private String colName;
        private String separator;

        @Override
        public Interceptor build() {
            return new LogInterceptor(colName,separator);
        }

        @Override
        public void configure(Context context) {
            colName = context.getString("colName","");
            separator = context.getString("separator"," ");
        }
    }
}
  1. 将代码打成jar包,添加到flume的lib目录下
  2. 编写配置
a1.sources = r1
a1.channels = c1
a1.sinks = k1

# 配置source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop101
a1.sources.r1.port = 6666

# 配置拦截器
a1.sources.r1.interceptors = i1
# 这里通过类的全限定名获取的是class文件,不是java文件,内部类前需加$符号
a1.sources.r1.interceptors.i1.type = com.bigdata.demo.LogInterceptor$Builder
a1.sources.r1.interceptors.i1.colName =id,name,age
a1.sources.r1.interceptors.i1.separator =,

# 配置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000

# 配置sink
a1.sinks.k1.type = file_roll
a1.sinks.k1.sink.directory = /home/hadoop/data

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

自定义hbase序列化器

  1. 实现AsyncHbaseEventSerializer,将数据写入hbase表中(参照SimpleAsyncHbaseEventSerializer类的源码)
    先导入依赖
<dependency>
<groupId>org.apache.flume.flume-ng-sinks</groupId>
<artifactId>flume-ng-hbase-sink</artifactId>
<version>1.7.0</version>
</dependency>
package com.bigdata.demo;

import com.google.common.base.Charsets;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.FlumeException;
import org.apache.flume.conf.ComponentConfiguration;
import org.apache.flume.sink.hbase.AsyncHbaseEventSerializer;
import org.hbase.async.AtomicIncrementRequest;
import org.hbase.async.PutRequest;

import java.util.ArrayList;
import java.util.List;

public class LogHbaseEventSerializer implements AsyncHbaseEventSerializer {
    private byte[] table;
    private byte[] cf;
    private byte[] payload;
    private byte[] incrementColumn;
    private byte[] incrementRow;
    private String separator;
    private String pCol;

    @Override
    public void initialize(byte[] table, byte[] cf) {
        this.table = table;
        this.cf = cf;
    }

    @Override
    public List<PutRequest> getActions() {
        List<PutRequest> actions = new ArrayList<PutRequest>();
        if (pCol != null) {
            byte[] rowKey;
            try {
                //获取rowkey,使用采集数据中的用户id当作rowkey
                //得到id
                String data = new String(payload);
                String[] strings = data.split(separator);
                String[] fields = pCol.split(",");

                if(strings.length != fields.length){
                    return actions;
                }

                String id = strings[0];
                rowKey = id.getBytes("UTF-8");

                for (int i = 0; i < strings.length; i++) {
                    PutRequest putRequest =  new PutRequest(table, rowKey, cf,
                            fields[i].getBytes("UTF-8"), strings[i].getBytes("UTF-8"));
                    actions.add(putRequest);
                }

            } catch (Exception e) {
                throw new FlumeException("Could not get row key!", e);
            }
        }
        return actions;
    }

    public List<AtomicIncrementRequest> getIncrements() {
        List<AtomicIncrementRequest> actions = new ArrayList<AtomicIncrementRequest>();
        if (incrementColumn != null) {
            AtomicIncrementRequest inc = new AtomicIncrementRequest(table,
                    incrementRow, cf, incrementColumn);
            actions.add(inc);
        }
        return actions;
    }

    @Override
    public void cleanUp() {
        // TODO Auto-generated method stub

    }

    @Override
    public void configure(Context context) {
        //HBase的列名称
        pCol = context.getString("colName", "pCol");
        //flume采集数据的分隔符
        separator = context.getString("separator", ",");
        String iCol = context.getString("incrementColumn", "iCol");


        if (iCol != null && !iCol.isEmpty()) {
            incrementColumn = iCol.getBytes(Charsets.UTF_8);
        }
        incrementRow = context.getString("incrementRow", "incRow").getBytes(Charsets.UTF_8);
    }

    @Override
    public void setEvent(Event event) {
        this.payload = event.getBody();
    }

    @Override
    public void configure(ComponentConfiguration conf) {
        // TODO Auto-generated method stub
    }
}
  1. 将代码打成jar包,添加到flume的lib目录下
  2. 编写配置
a1.sources = r1
a1.channels = c1
a1.sinks = k1

# 配置source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop101
a1.sources.r1.port = 6666

# 配置channel
a1.channels.c1.type = memory

# 配置sink
a1.sinks.k1.type = asynchbase
a1.sinks.k1.table = myhbase
a1.sinks.k1.columnFamily = c
a1.sinks.k1.serializer = com.bigdata.demo.LogHbaseEventSerializer
a1.sinks.k1.serializer.colName = id,name,age
a1.sinks.k1.serializer.separator = ,

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Agent的内部原理

flume 配置文件 flume配置参数详解_hadoop_03

Flume的故障转移和负载均衡

使用sink组

故障转移

使用sink组对应一个channel,sink组中只能有一个sink在take数据。如果该sink出现了故障,sink组中的可以使用另一个sink来take数据。

sink有一个与之相关的优先级,数量越大,优先级越高。

failover.conf:

a1.sources = r1
a1.channels = c1
a1.sinks = k1 k2
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2

# 配置source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop101
a1.sources.r1.port = 6666

# 配置channel
a1.channels.c1.type = memory

# 配置sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102
a1.sinks.k1.port = 4545

a1.sinks.k2.type = file_roll
a1.sinks.k2.sink.directory = /home/hadoop/data/flume

# 配置故障转移
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 50
a1.sinkgroups.g1.processor.priority.k2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

负载均衡

负载均衡sink处理器提供了在多个sink上实现负载均衡流的能力。

它维护一个活动接收器的索引列表,负载必须分布在该列表上。

实现支持通过round_robin或random选择机制分配负载。

选择机制默认为round_robin类型,但可以通过配置重写。

a1.sources = r1
a1.channels = c1
a1.sinks = k1 k2

# 配置source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop101
a1.sources.r1.port = 6666

# 配置channel
a1.channels.c1.type = memory

# 配置负载均衡
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = true
a1.sinkgroups.g1.processor.selector = round_robin

# 配置sink
a1.sinks.k1.type = file_roll
a1.sinks.k1.sink.directory = /home/hadoop/data/flume01

a1.sinks.k2.type = file_roll
a1.sinks.k2.sink.directory = /home/hadoop/data/flume

# 绑定channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1