1、Flume介绍
Flume是一个分布式、可靠、和高可用的海量日志聚合的系统,支持在系统中定制各类数据发送方,用于收集数据;同时,Flume提供对数据进行简单处理,并写到各种数据接受方(可定制)的能力。
设计目标:
可靠性当节点出现故障时,日志能够被传送到其他节点上而不会丢失。Flume提供了三种级别的可靠性保障,从强到弱依次分别为:end-to-end(收到数据agent首先将event写到磁盘上,当数据传送成功后,再删除;如果数据发送失败,可以重新发送。),Store on failure(这也是scribe采用的策略,当数据接收方crash时,将数据写到本地,待恢复后,继续发送),Best effort(数据发送到接收方后,不会进行确认)。
可扩展性Flume采用了三层架构,分别为agent,collector和storage,每一层均可以水平扩展。其中,所有agent和collector由master统一管理,这使得系统容易监控和维护,且master允许有多个(使用ZooKeeper进行管理和负载均衡),这就避免了单点故障问题。
可管理性所有agent和colletor由master统一管理,这使得系统便于维护。多master情况,Flume利用ZooKeeper和gossip,保证动态配置数据的一致性。用户可以在master上查看各个数据源或者数据流执行情况,且可以对各个数据源配置和动态加载。Flume提供了web 和shell script command两种形式对数据流进行管理。
功能可扩展性用户可以根据需要添加自己的agent,collector或者storage。此外,Flume自带了很多组件,包括各种agent(file, syslog等),collector和storage(file,HDFS等)。
一般实时系统,所选用组件如下:
数据采集 :负责从各节点上实时采集数据,选用cloudera的flume来实现
数据接入 :由于采集数据的速度和数据处理的速度不一定同步,因此添加一个消息中间件来作为缓冲,选用apache的kafka
流式计算 :对采集到的数据进行实时分析,选用apache的storm
数据输出 :对分析后的结果持久化,暂定用mysql ,另一方面是模块化之后,假如当Storm挂掉了之后,数据采集和数据接入还是继续在跑着,数据不会丢失,storm起来之后可以继续进行流式计算;
3、Flume 的 一些核心概念
3、Flume的整体构成图
以上内容来自
4、实例应用
4.1 flume接收数据到指定文件
flume.conf配置
# Flume agent config
a1.channels = c1
a1.sources = r1
a1.sinks = k1
a1.channels.c1.type = memory
a1.sources.r1.channels = c1
a1.sources.r1.type = avro
# For using a thrift source set the following instead of the above line.
# a1.source.r1.type = thrift
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 41414
a1.sinks.k1.channel = c1
a1.sinks.k1.type = file_roll
a1.sinks.k1.sink.directory =/var/flume_log
app启动类App.java
package com.bigdata.flume;
/**
* @ClassName: App
* @Description:TODO(启动类)
* @author: Jimu
* @email: maker2win@163.com
* @date: 2019年2月11日 下午5:23:10
*
* @Copyright: 2019 www.maker-win.net Inc. All rights reserved.
*
*/
public class App
{
public static void main(String[] args) {
MyRpcClientFacade client = new MyRpcClientFacade();
// Initialize client with the remote Flume agent's host and port
client.init("node6.sdp.cn", 41414);
// Send 10 events to the remote Flume agent. That agent should be
// configured to listen with an AvroSource.
String sampleData = "Hello Flume!";
for (int i = 0; i < 10; i++) {
client.sendDataToFlume(sampleData);
System.out.println(sampleData);
}
client.cleanUp();
}
}
发送数据到agent门面类 MyRpcClientFacade.java
package com.bigdata.flume;
import java.nio.charset.Charset;
import org.apache.flume.Event;
import org.apache.flume.EventDeliveryException;
import org.apache.flume.api.RpcClient;
import org.apache.flume.api.RpcClientFactory;
import org.apache.flume.event.EventBuilder;
/**
* @ClassName: MyRpcClientFacade
* @Description:TODO(这里用一句话描述这个类的作用)
* @author: Jimu
* @email: maker2win@163.com
* @date: 2019年1月31日 下午4:54:21
*
* @Copyright: 2019 www.maker-win.net Inc. All rights reserved.
*
*/
public class MyRpcClientFacade {
private RpcClient client;
private String hostname;
private int port;
public void init(String hostname, int port) {
// Setup the RPC connection
this.hostname = hostname;
this.port = port;
this.client = RpcClientFactory.getDefaultInstance(hostname, port);
// Use the following method to create a thrift client (instead of the above line):
// this.client = RpcClientFactory.getThriftInstance(hostname, port);
}
public void sendDataToFlume(String data) {
// Create a Flume Event object that encapsulates the sample data
Event event = EventBuilder.withBody(data, Charset.forName("UTF-8"));
// Send the event
try {
client.append(event);
} catch (EventDeliveryException e) {
// clean up and recreate the client
client.close();
client = null;
client = RpcClientFactory.getDefaultInstance(hostname, port);
// Use the following method to create a thrift client (instead of the above line):
// this.client = RpcClientFactory.getThriftInstance(hostname, port);
}
}
public void cleanUp() {
// Close the RPC connection
client.close();
}
}
maven依赖配置
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
<version>2.1</version>
</dependency>
<dependency>
<groupId>org.apache.flume</groupId>
<artifactId>flume-ng-core</artifactId>
<version>1.5.2</version>
</dependency>
在agent所在节点创建日志文件存放目录
#创建目录
mkdir flume_log
#修改目录所属用户及组
chown flume:hadoop ./flume_log
cd flume_log
启动App.java后,查看flume_log以生存内容文件
4.2 flume接收kafka数据到hdfs
#定义sources、channel及sink
agent.sources = kafkaSource
agent.channels = memoryChannel
agent.sinks = hdfsSink
agent.sources.kafkaSource.channels = memoryChannel
agent.sinks.hdfsSink.channel = memoryChannel
agent.sources.kafkaSource.type = org.apache.flume.source.kafka.KafkaSource
agent.sources.kafkaSource.zookeeperConnect = node3.sdp.cn:2181,node4.sdp.cn:2181,node5.sdp.cn:2181,node6.sdp.cn:2181
agent.sources.kafkaSource.topic = applog
agent.sources.kafkaSource.groupId = flume
agent.sources.kafkaSource.kafka.consumer.timeout.ms = 100
agent.channels.memoryChannel.type = memory
agent.channels.memoryChannel.capacity=10000
agent.channels.memoryChannel.transactionCapacity=1000
agent.sinks.hdfsSink.type = hdfs
agent.sinks.hdfsSink.hdfs.path = hdfs://node1.sdp.cn:8020/tmp/applogs/%{logApp}-%Y%m%d
agent.sinks.hdfsSink.hdfs.writeFormat = Text
agent.sinks.hdfsSink.hdfs.fileType = DataStream
agent.sinks.hdfsSink.hdfs.rollSize = 0
agent.sinks.hdfsSink.hdfs.rollCount = 0
agent.sinks.hdfsSink.hdfs.rollInterval = 600
agent.sinks.hdfsSink.hdfs.filePrefix=applog
agent.sinks.hdfsSink.hdfs.fileSuffix=.log
agent.sinks.hdfsSink.hdfs.inUserPrefix=_
agent.sinks.hdfsSink.hdfs.inUserSuffix=
agent.sources.kafkaSource.interceptors=i1
agent.sources.kafkaSource.interceptors.i1.type=org.bigdata.flume.LogInterceptor$Builder
查看hdfs执行结果
文件日志内容