使用Flume采集数据上传到HDFS实验总结 flume采集kafka数据写入hdfs

转载

mob64ca14010a69 2024-05-15 20:56:54

1、Flume介绍

Flume是一个分布式、可靠、和高可用的海量日志聚合的系统，支持在系统中定制各类数据发送方，用于收集数据；同时，Flume提供对数据进行简单处理，并写到各种数据接受方（可定制）的能力。

设计目标：

可靠性当节点出现故障时，日志能够被传送到其他节点上而不会丢失。Flume提供了三种级别的可靠性保障，从强到弱依次分别为：end-to-end（收到数据agent首先将event写到磁盘上，当数据传送成功后，再删除；如果数据发送失败，可以重新发送。），Store on failure（这也是scribe采用的策略，当数据接收方crash时，将数据写到本地，待恢复后，继续发送），Best effort（数据发送到接收方后，不会进行确认）。
可扩展性Flume采用了三层架构，分别为agent，collector和storage，每一层均可以水平扩展。其中，所有agent和collector由master统一管理，这使得系统容易监控和维护，且master允许有多个（使用ZooKeeper进行管理和负载均衡），这就避免了单点故障问题。
可管理性所有agent和colletor由master统一管理，这使得系统便于维护。多master情况，Flume利用ZooKeeper和gossip，保证动态配置数据的一致性。用户可以在master上查看各个数据源或者数据流执行情况，且可以对各个数据源配置和动态加载。Flume提供了web 和shell script command两种形式对数据流进行管理。
功能可扩展性用户可以根据需要添加自己的agent，collector或者storage。此外，Flume自带了很多组件，包括各种agent（file， syslog等），collector和storage（file，HDFS等）。
一般实时系统，所选用组件如下：

数据采集：负责从各节点上实时采集数据，选用cloudera的flume来实现
数据接入：由于采集数据的速度和数据处理的速度不一定同步，因此添加一个消息中间件来作为缓冲，选用apache的kafka
流式计算：对采集到的数据进行实时分析，选用apache的storm
数据输出：对分析后的结果持久化，暂定用mysql ，另一方面是模块化之后，假如当Storm挂掉了之后，数据采集和数据接入还是继续在跑着，数据不会丢失，storm起来之后可以继续进行流式计算；

3、Flume 的一些核心概念

使用Flume采集数据上传到HDFS实验总结 flume采集kafka数据写入hdfs_kafka

3、Flume的整体构成图

使用Flume采集数据上传到HDFS实验总结 flume采集kafka数据写入hdfs_数据_02

以上内容来自

4、实例应用

4.1 flume接收数据到指定文件

flume.conf配置

# Flume agent config
a1.channels = c1
a1.sources = r1
a1.sinks = k1

a1.channels.c1.type = memory

a1.sources.r1.channels = c1
a1.sources.r1.type = avro
# For using a thrift source set the following instead of the above line.
# a1.source.r1.type = thrift
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 41414

a1.sinks.k1.channel = c1
a1.sinks.k1.type = file_roll
a1.sinks.k1.sink.directory =/var/flume_log

app启动类App.java

package com.bigdata.flume;

/**   
 * @ClassName:  App   
 * @Description:TODO(启动类)   
 * @author: Jimu 
 * @email:  maker2win@163.com 
 * @date:   2019年2月11日 下午5:23:10   
 *     
 * @Copyright: 2019 www.maker-win.net Inc. All rights reserved. 
 *  
 */ 
public class App 
{
	public static void main(String[] args) {
	    MyRpcClientFacade client = new MyRpcClientFacade();
	    // Initialize client with the remote Flume agent's host and port
	    client.init("node6.sdp.cn", 41414);

	    // Send 10 events to the remote Flume agent. That agent should be
	    // configured to listen with an AvroSource.
	    String sampleData = "Hello Flume!";
	    for (int i = 0; i < 10; i++) {
	      client.sendDataToFlume(sampleData);
	      System.out.println(sampleData);
	    }

	    client.cleanUp();
	  }
}

发送数据到agent门面类 MyRpcClientFacade.java

package com.bigdata.flume;

import java.nio.charset.Charset;

import org.apache.flume.Event;
import org.apache.flume.EventDeliveryException;
import org.apache.flume.api.RpcClient;
import org.apache.flume.api.RpcClientFactory;
import org.apache.flume.event.EventBuilder;

/**   
 * @ClassName:  MyRpcClientFacade   
 * @Description:TODO(这里用一句话描述这个类的作用)   
 * @author: Jimu 
 * @email:  maker2win@163.com 
 * @date:   2019年1月31日 下午4:54:21   
 *     
 * @Copyright: 2019 www.maker-win.net Inc. All rights reserved. 
 *  
 */
public class MyRpcClientFacade {
	 private RpcClient client;
	  private String hostname;
	  private int port;

	  public void init(String hostname, int port) {
	    // Setup the RPC connection
	    this.hostname = hostname;
	    this.port = port;
	    this.client = RpcClientFactory.getDefaultInstance(hostname, port);
	    // Use the following method to create a thrift client (instead of the above line):
	    // this.client = RpcClientFactory.getThriftInstance(hostname, port);
	  }

	  public void sendDataToFlume(String data) {
	    // Create a Flume Event object that encapsulates the sample data
	    Event event = EventBuilder.withBody(data, Charset.forName("UTF-8"));

	    // Send the event
	    try {
	      client.append(event);
	    } catch (EventDeliveryException e) {
	      // clean up and recreate the client
	      client.close();
	      client = null;
	      client = RpcClientFactory.getDefaultInstance(hostname, port);
	      // Use the following method to create a thrift client (instead of the above line):
	      // this.client = RpcClientFactory.getThriftInstance(hostname, port);
	    }
	  }

	  public void cleanUp() {
	    // Close the RPC connection
	    client.close();
	  }
}

maven依赖配置

<dependency>
		    <groupId>org.apache.logging.log4j</groupId>
		    <artifactId>log4j-core</artifactId>
		    <version>2.1</version> 
		</dependency>
	<dependency>
	    <groupId>org.apache.flume</groupId>
	    <artifactId>flume-ng-core</artifactId>
	    <version>1.5.2</version>
	</dependency>

在agent所在节点创建日志文件存放目录

#创建目录
mkdir flume_log
#修改目录所属用户及组
chown flume:hadoop ./flume_log
cd flume_log

启动App.java后，查看flume_log以生存内容文件

4.2 flume接收kafka数据到hdfs

#定义sources、channel及sink
agent.sources = kafkaSource
agent.channels = memoryChannel
agent.sinks = hdfsSink

agent.sources.kafkaSource.channels = memoryChannel
agent.sinks.hdfsSink.channel = memoryChannel

agent.sources.kafkaSource.type = org.apache.flume.source.kafka.KafkaSource
agent.sources.kafkaSource.zookeeperConnect = node3.sdp.cn:2181,node4.sdp.cn:2181,node5.sdp.cn:2181,node6.sdp.cn:2181
agent.sources.kafkaSource.topic = applog 
agent.sources.kafkaSource.groupId = flume
agent.sources.kafkaSource.kafka.consumer.timeout.ms = 100

agent.channels.memoryChannel.type = memory
agent.channels.memoryChannel.capacity=10000
agent.channels.memoryChannel.transactionCapacity=1000


agent.sinks.hdfsSink.type = hdfs
agent.sinks.hdfsSink.hdfs.path = hdfs://node1.sdp.cn:8020/tmp/applogs/%{logApp}-%Y%m%d
agent.sinks.hdfsSink.hdfs.writeFormat = Text
agent.sinks.hdfsSink.hdfs.fileType = DataStream


agent.sinks.hdfsSink.hdfs.rollSize = 0
agent.sinks.hdfsSink.hdfs.rollCount = 0
agent.sinks.hdfsSink.hdfs.rollInterval = 600

agent.sinks.hdfsSink.hdfs.filePrefix=applog
agent.sinks.hdfsSink.hdfs.fileSuffix=.log


agent.sinks.hdfsSink.hdfs.inUserPrefix=_
agent.sinks.hdfsSink.hdfs.inUserSuffix=

agent.sources.kafkaSource.interceptors=i1
agent.sources.kafkaSource.interceptors.i1.type=org.bigdata.flume.LogInterceptor$Builder

查看hdfs执行结果

使用Flume采集数据上传到HDFS实验总结 flume采集kafka数据写入hdfs_kafka_03