Spark有流式SQL吗

转载

时光机3号 2024-09-21 07:24:59

文章标签 Spark有流式SQL吗 spark 数据 hadoop 文章分类 Spark 大数据

Flume自定义拦截器开发
1）进入IDEA，给spark-log4j这个项目名称，单独加
Module--->maven--->next--->Artifactld：log-flume--->next--->Module name：log-flume--->finish
2）进入主的pom.xml
添加flume的版本

<properties>
        <maven.compiler.source>1.8</maven.compiler.source>
        <maven.compiler.target>1.8</maven.compiler.target>
        <encoding>UTF-8</encoding>
        <hadoop.version>2.6.0-cdh5.16.2</hadoop.version>
        <flume.version>1.6.0-cdh5.16.2</flume.version>
</properties>

添加依赖

<dependency>
        <groupId>org.apache.flume</groupId>
        <artifactId>flume-ng-core</artifactId>
        <version>${flume.version}</version>
</dependency>

在pom.xml中的空白处，点击右键--->maven--->reload project
3）进入子...\IdeaProjects\spark-log4j\log-flume\pom.xml中
添加依赖

<dependencies>
        <dependency>
            <groupId>org.apache.flume</groupId>
            <artifactId>flume-ng-core</artifactId>
        </dependency>
</dependencies>

在pom.xml中的空白处，点击右键--->maven--->reload project
在右端maven中查看依赖的下载状况。
4）创建package
进入C:\Users\jieqiong\IdeaProjects\spark-log4j\log-flume\src\main\java
创建package：com.imooc.bigdata.flume
创建域名拦截器class：DomainIntercepter.java

package com.imooc.bigdata.flume;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;

import java.util.ArrayList;
import java.util.List;
import java.util.Map;

/*
*  1、首先实现flume给我们的接口：implements Interceptor并导包org.apache.flume.interceptor.Interceptor
*  2、将要实现的方法实现一下：点击implement methods
*  首先是一个初始化的方法：initialize()
*  其次是方法签名处理单个事件，进来一个event，然后做一件事情intercept：intercept(Event event)
*  然后是处理多个事件的：intercept(List<Event> list)
*  最后是资源释放的工作：close()
*
*  如何写这个东西呢？？
*  在处理过程中，在Flume中的数据都是以一个event单位过来进行处理的
*  所以先声明一个集合：List<Event> events;
*  并在initialize()中进行初始化：events = new ArrayList<Event>();
*
*  然后重点实现单个事件的处理
*  因为多个事件的处理，即是对单个事件处理的调用
*  在处理的过程中，数据信息都是在intercept(Event event)中的event里的
*  而且在拦截器中，我们是要获取header里的信息的：
*  在intercept(Event event)中，输入event.getHeaders().选择var,自动生成Map<String, String> headers = event.getHeaders();
*
*  获取body信息event.getBody()的结果格式是byte[]，所以直接强制转换为string：String body = new String(event.getBody());
*
*  单个事件处理：
*  判断域名：body是否包含imooc，或者gifshow：if(body.contains("imooc"))
*  打标：headers.put("type","imooc");
*  返回值为event
*  其实event内容，并没有做处理，body里的东西也没有做处理。只对header做了一些手脚。
*
*  批处理：
*  首先每一个批处理先清除掉
*  其次做遍历，将每一个event加到events里，调用单个事件的处理方法
*
*  在扫尾工作中，将events设为空即可。
*
*  看官网的拦截器规范，我们还要实现一个builder
*  点击implement methods，有两个方法：build、configure
*  不看configure方法
*  在build方法中，只要返回一个DomainIntercepter()
*
*  开始打包
*  右侧---> maven ---> log-flume ---> lifecycle ---> package
*  左侧---> ...\IdeaProjects\spark-log4j\log-flume\target中会有一个log-flume-1.0.jar包，传到服务器上即可。
*
*/
public class DomainIntercepter implements Interceptor {

    List<Event> events;

    @Override
    public void initialize() {
        events = new ArrayList<Event>();

    }

    @Override
    public Event intercept(Event event) {
        Map<String, String> headers = event.getHeaders();
        String body = new String(event.getBody());

        if(body.contains("imooc")){
            headers.put("type","imooc");
        }else {
            headers.put("type","other");
        }
        return event;
    }

    @Override
    public List<Event> intercept(List<Event> list) {

        events.clear();
        for (Event event : list){
            events.add(intercept(event)); //调用单个事件的处理方法
        }
        return events;
    }

    @Override
    public void close() {
        events = null;
    }

    public static class Builder implements Interceptor.Builder{

        @Override
        public Interceptor build() {
            return new DomainIntercepter();
        }

        @Override
        public void configure(Context context) {
        }
    }
}

Flume自定义拦截器Agent配置
需求分析在大数据Spark实时处理--数据收集1（Flume）中
1）从官网上拷贝了一个配置文件：flume/FlumeUserGuide.rst at trunk · apache/flume · GitHub
然后改改就行了

# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

2）第一层Agent的名字叫flume01.conf

# source ---> 拦截器 ---> 拦截器分开数据 ---> 对应到各自的端口上

# 两个Channel、Sink
a1.sources = r1
a1.channels = c1 c2
a1.sinks = k1 k2

# 通过netcat、本机以及44444端口来接收数据
a1.sources.r1.type = netcat
a1.sources.r1.bind = spark000
a1.sources.r1.port = 44444

# 定义拦截器
# Host Interceptor
# 拦截器的名称和类型
# 在IDEA中，是自定义的拦截器，复制DomainIntercepter的copy reference
# 除此以外，还有加$Builder
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = com.imooc.bigdata.flume.DomainIntercepter$Builder

# 数据是怎么从source到channel呢？
# --->拦截器的配置Multiplexing Channel Selector
# 结合IDEA代码
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = type
a1.sources.r1.selector.mapping.imooc = c1
a1.sources.r1.selector.mapping.other = c2

# 两个Channel的类型都为memory
a1.channels.c1.type = memory
a1.channels.c2.type = memory

a1.sinks.k1.type = avro  
a1.sinks.k1.hostname = spark000
a1.sinks.k1.port = 44445

a1.sinks.k2.type = avro  
a1.sinks.k2.hostname = spark000
a1.sinks.k2.port = 44446

a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

3）第二层第一个Agent的名字叫flume02.conf
接上一个Agent：flume01.conf

a2.sources = r1
a2.sinks = k1
a2.channels = c1

a2.sources.r1.type = avro
a2.sources.r1.bind = spark000
a2.sources.r1.port = 44445

# channel类型使用内存
a2.channels.c1.type = memory

# 结果输出到控制台上
a2.sinks.k1.type = logger

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

4）第二层第二个Agent的名字叫flume03.conf
接上一个Agent：flume01.conf

a3.sources = r1
a3.sinks = k1
a3.channels = c1

a3.sources.r1.type = avro
a3.sources.r1.bind = spark000
a3.sources.r1.port = 44446

# channel类型使用内存
a3.channels.c1.type = memory

# 结果输出到控制台上
a3.sinks.k1.type = logger

# Bind the source and sink to the channel
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

Flume自定义拦截器功能测试
1）将上述三个agent配置文件写入

[hadoop@spark000 lib]$ cd /home/hadoop/app/apache-flume-1.6.0-cdh5.16.2-bin/config
[hadoop@spark000 config]$ vi flume01.conf
[hadoop@spark000 config]$ vi flume02.conf
[hadoop@spark000 config]$ vi flume03.conf

2）上传本地jar包

[hadoop@spark000 lib]$ pwd
/home/hadoop/app/apache-flume-1.6.0-cdh5.16.2-bin/lib
[hadoop@spark000 lib]$ ls
log-flume-1.0.jar

3）先启动下游agent，即2和3选一个，先启动。这里选agent3。
打开三个Xshell界面，连接spark000
启动后，会监听到44446端口

flume-ng agent \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/config/flume03.conf \
--name a3 \
-Dflume.root.logger=INFO,console

4）再启动agent2
会监听到44445端口

flume-ng agent \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/config/flume02.conf \
--name a2 \
-Dflume.root.logger=INFO,console

5）最后启动agent1
会监听到44444端口，并有连接44445和44446信息

flume-ng agent \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/config/flume01.conf \
--name a1 \
-Dflume.root.logger=INFO,console

6）通过telnet连接到44444端口
新开一个界面，直接：

telnet spark000 44444

imooc.com
test.com
gifshow.com
pk.com

使用Flume收集日志服务器落地的日志数据
1）项目里的数据，对接到Flume中。
2）配置access-collect.conf

[hadoop@spark000 config]$ pwd
/home/hadoop/app/apache-flume-1.6.0-cdh5.16.2-bin/config
[hadoop@spark000 config]$ ls
access-collect.conf

a1.sources = r1
a1.sinks = k1
a1.channels = c1

a1.sources.r1.type = TAILDIR
a1.sources.r1.channels = c1
a1.sources.r1.positionFile = /home/hadoop/tmp/position/taildir_position.json
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /home/hadoop/logs/access.log
a1.sources.r1.headers.f1.headerKey1 = pk
a1.sources.r1.fileHeader = true

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

3）启动

flume-ng agent \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/config/access-collect.conf \
--name a1 \
-Dflume.root.logger=INFO,console

4）前面一章节的数据，目前落在磁盘上了
5）再测试：打开代码，再输出一些数据，目前测试不通过C:\Users\jieqiong\IdeaProjects\spark-log4j\log-service\src\main\java\com\imooc\bigdata\log\utils\Test.java
还是之前的原因，因为LogGenerator不识别
6）即，通过Flume，日志服务器上的日志数据落在了磁盘上/home/hadoop/logs/access.log

谈谈对Flume高可用的理解
1）一台log-server，对应一个Flume的agent（第一层的Flume）
有多少个log-server，在第一层的Flume就对应多少个的agent。
2）两层Flume架构
前面的agent和后面的agent通信的时候，左面agent输出采用avro方式，右面接收的时候，也要采用avro的方式。
不同的agent/机器上，要经过一个RPC的一个传输，这里是使用的AVRO的方式来进行交互的
即，前面的Sink和后面的Source务必要采用AVRO的方式
3）第一层的Flume。若要保证数据不丢失，
source选择TAILDIR，因为在处理数据的过程中，周期性的将每一个收集过来的文件数据的偏移量写入到一个JSON文件中。也就是说这批次的数据处理到哪里了，会将这个偏移量写进到JSON里，如果再进行下一个批次文件数据的处理，就从这个JSON里将上次的偏移量取出来。即使Flume挂掉了，现在继续将数据往Flume里灌入，启动Flume，也会从指定的偏移量继续向后获取到数据。这就是比较厉害的一点。
channel选择File Channel，并配置checkpointDir，指定数据存放目录。如果出现问题，数据直接落到磁盘上。
两个sink，选择failover，高可用是否能实现主要会出在sink。配置sink的优先级，若优先级高的sink挂掉，则会自动切换至优先级较低的sink来运行。
4）第二层的Flume。和第一层进行相同的配置。
即中间多做了一次聚合操作，安全性提高，直接访问HDFS的并发量减少，减少小文件问题（即从agent到hdfs过程）。
第二层的agent的数量，肯定比第一层的数量要少。
5）Flume结束后，若要离线处理就落在HDFS上，若要实时处理就对接Kafka
6）反过来，若HDFS出问题，sink不出去了数据，则数据都会存储在第二层agent里的channel中，不影响第一层的agent。
7）另外一方面，日志服务器和和大数据HDFS是解耦的，即后端升级，不影响前端日志服务器和flume。