流程概述

flume作为生产者监控一个txt文件,该文件里的数据通过py脚本添加,flume将sink设置为kafka的某个topic,这样txt文件一有新增数据,flume就收集数据传到kafka的topic中,启动spark streaming程序消费kafka,spark streaming将从kafak拉取的数据处理后,写入hive表中

flume一个source怎么把数据写入到多个channel_kafka

环境准备

1.启动zookeeper集群
2.启动kafka(master)
3.启动flume(master)
4.启动hadoop集群
5.启动mysql(master)
6.启动hive(master)

hive建分区表

hive (mydb)> create table order_partition(order_id string,user_id string)
           > partitioned by(dt string);

flume配置文件

[root@master conf]# vi flume_kafka.conf 
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
# 监控/home/boya/flume_exec_test.txt
a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /home/boya/flume_exec_test.txt

#a1.sinks.k1.type = logger
# 设置kafka接收器
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
# 设置kafka的broker地址和端口号
a1.sinks.k1.brokerList=master:9092
# 设置Kafka的topic
a1.sinks.k1.topic=mytest
# 设置序列化的方式
a1.sinks.k1.serializer.class=kafka.serializer.StringEncoder

# use a channel which buffers events in memory
a1.channels.c1.type=memory
a1.channels.c1.capacity = 100000
a1.channels.c1.transactionCapacity = 1000

# Bind the source and sink to the channel
a1.sources.r1.channels=c1
a1.sinks.k1.channel=c1

JSON数据源

写入txt文件的数据是JSON格式,具体如下

{"order_id": 2539329, "user_id": 1, "eval_set": "prior", "order_number": 1, "order_dow": 2, "hour": 8, "day": 0.0}
{"order_id": 2398795, "user_id": 1, "eval_set": "prior", "order_number": 2, "order_dow": 3, "hour": 7, "day": 15.0}
{"order_id": 473747, "user_id": 1, "eval_set": "prior", "order_number": 3, "order_dow": 3, "hour": 12, "day": 21.0}
...

以JSON数据源格式创建Order类

用以将JSON数据解析成Order类的对象,注意Order里的字段要和JSON数据的各个key要对应

package com.spark_streaming;

public class Orders {
    public String order_id;
    public String user_id;
    public String eval_set;
    public String order_number;
    public String order_dow;
    public String hour;
    public String day;

    public String getOrder_id() {
        return order_id;
    }

    public void setOrder_id(String order_id) {
        this.order_id = order_id;
    }

    public String getUser_id() {
        return user_id;
    }

    public void setUser_id(String user_id) {
        this.user_id = user_id;
    }

    public String getEval_set() {
        return eval_set;
    }

    public void setEval_set(String eval_set) {
        this.eval_set = eval_set;
    }

    public String getOrder_number() {
        return order_number;
    }

    public void setOrder_number(String order_number) {
        this.order_number = order_number;
    }

    public String getOrder_dow() {
        return order_dow;
    }

    public void setOrder_dow(String order_dow) {
        this.order_dow = order_dow;
    }

    public String getHour() {
        return hour;
    }

    public void setHour(String hour) {
        this.hour = hour;
    }

    public String getDay() {
        return day;
    }

    public void setDay(String day) {
        this.day = day;
    }


}

spark streaming Receiver方式消费Kafka数据

spark streaming以Receiver方式消费Kafka数据,并将消费的数据以追加方式写入hive里的order_partiton表中

代码大致流程如下:

  1. 设置好zookeeper地址,kafka topic,batch时间间隔,消费组名称,numThreads
  2. 创建消费Kafka的DStream,其拉取的数据格式是JSON,并且DStream接收的格式是(header,body),我们只取body
  3. 定义一个rdd转为DataFrame的方法,该方法里先创建SparkSession,并配置好hive表的自动分区等信息,还要设置支持hive即相当于HiveContext,对传入的rdd里每条String类型的记录,先将其以事先定义好的Orders类型解析成Order对象,但我们只将每个对象的部分字段插入hive表里,于是再以该对象的部分字段构造case class order对象,最后将整个rdd转换为DataFrame,此时的DataFrame就相当于是具有两列字段的二维表了
  4. 将DStream里的每个rdd,转换为DataFrame,并在DF里增加一列“dt”,其值就是日期,然后用追加的形式插入到hive表中
package com.spark_streaming

import com.alibaba.fastjson.JSON
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
import org.apache.spark.sql.functions._
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.kafka.KafkaUtils

object ReceiverTest {
  case class order(order_id:String,user_id:String)

  def main(args: Array[String]): Unit = {
    /**
      * group_id :消费者组
      * topic :消费的kafka主题
      * exectime :spark streaming划分batch的时间间隔
      * zkQuorum :zookeeper集群地址
     */
    val dt = "20190714"
    val Array(group_id,topic,exectime,zkQuorum) = Array("group_mytest","mytest","2","192.168.230.10:2181")

    val sparkConf = new SparkConf()
        //.setMaster("local[2]")
        .setAppName("Receiver Test")
    val ssc = new StreamingContext(sparkConf,Seconds(exectime.toInt))

    Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)

    // 实际生产中要消费的topic不止一个
    val topicSet = topic.split(",").toSet
    val numThreads = 1
    val topicMap = topicSet.map((_,numThreads.toInt)).toMap

    // lines接收到的数据格式是(null,e)即(header,body),而我们只想要body
    val lines = KafkaUtils.createStream(ssc,zkQuorum,group_id,topicMap).map(_._2)
    //lines.map((_,1L)).reduceByKey(_+_).print()


    // 生成一个rdd转DF的方法
    def rdd2DF(rdd:RDD[String]):DataFrame = {
      val spark = SparkSession.builder().appName("rdd2DF")
           .config("hive.exec.dynamic.partition","true")
           .config("hive.exec.dynamic.partition.mode","nonstrict")
           .enableHiveSupport().getOrCreate()
      import spark.implicits._
      rdd.map{x =>
        // x是JSON格式的字符串,把x解析成Orders类型的对象,再将其部分字段构造成我们需要的case class order的对象
        val mess = JSON.parseObject(x,classOf[Orders])
        order(mess.order_id,mess.user_id)
        }.toDF()
    }

    // Dstream核心处理逻辑,对DStream中每个rdd转换成DF
    // 然后通过DF结构将数据追加到hive分区表中,"order_id,user_id,dt"
    val log = lines.foreachRDD { rdd =>
      val df = rdd2DF(rdd)
      df.withColumn("dt", lit(dt.toString))
        .write.mode(SaveMode.Append)
        .insertInto("mydb.order_partition")
    }
    ssc.start()
    ssc.awaitTermination()
  }
}

以yarn-client提交spark streaming应用程序

sparkstreaming提交并启动后,等待手动执行py脚本获取实时数据,注意提交时要带上hive配置文件和mysql连接驱动包

[root@master spark-2.0.2-bin-hadoop2.6]# ./bin/spark-submit --class com.spark_streaming.ReceiverTest --master yarn-client --files $HIVE_HOME/conf/hive-site.xml --jars /home/boya/mysql-connector-java-5.1.44-bin.jar /home/boya/boya-1.0-SNAPSHOT.jar

执行py脚本模拟输入日志文件

执行该脚本时,从orders.csv读数据写入flume_exec_test.txt,flume监控该txt文件(其sink已配置好kafka连接及mytest主题),flume作为生产者将该文件新增的数据收集入kafka的mytest主题中,另一边启动好的spark streaming应用程序作为消费者从kafka中拉取数据消费,最后写入hive分区表中

[root@master boya]# cat flume_data_write.py 
# -*- coding: utf-8 -*-
import random
import time
import pandas as pd
import json

writeFileName="./flume_exec_test.txt"
cols = ["order_id","user_id","eval_set","order_number","order_dow","hour","day"] 
df1 = pd.read_csv('./orders.csv')
df1.columns = cols
df = df1.fillna(0)
with open(writeFileName,'a+')as wf:
	for idx,row in df.iterrows():
		d = {}
		for col in cols:
			d[col]=row[col]
		js = json.dumps(d)
		wf.write(js+'\n')
	#	rand_num = random.random()
            #	time.sleep(rand_num)

因为数据量太多,执行了一会后我就手动kill掉应用程序了,此时观察hive表是否有数据

hive查看表中获得的数据

hive (mydb)> select * from order_partition limit 20;
OK
2539329	1	20190714
2398795	1	20190714
473747	1	20190714
2254736	1	20190714
431534	1	20190714
3367565	1	20190714
550135	1	20190714
3108588	1	20190714
2295261	1	20190714
2550362	1	20190714
1187899	1	20190714
2168274	2	20190714
1501582	2	20190714
1901567	2	20190714
738281	2	20190714
1673511	2	20190714
1199898	2	20190714
3194192	2	20190714
788338	2	20190714
1718559	2	20190714
Time taken: 2.595 seconds, Fetched: 20 row(s)
hive (mydb)>

HDFS查看hive表目录结构

创建的hive表为内部分区表,以dt为分区字段,程序中指定dt = “20190714”,并设置好自动分区

[root@master ~]# hdfs dfs -ls /hive/warehouse/mydb.db/order_partition/dt=20190714
Found 19 items
-rwxr-xr-x   1 root supergroup        986 2019-07-14 15:46 /hive/warehouse/mydb.db/order_partition/dt=20190714/part-00000
-rwxr-xr-x   1 root supergroup       1078 2019-07-14 15:46 /hive/warehouse/mydb.db/order_partition/dt=20190714/part-00000_copy_1
-rwxr-xr-x   1 root supergroup       1071 2019-07-14 15:46 /hive/warehouse/mydb.db/order_partition/dt=20190714/part-00000_copy_2
-rwxr-xr-x   1 root supergroup       3209 2019-07-14 15:46 /hive/warehouse/mydb.db/order_partition/dt=20190714/part-00000_copy_3
-rwxr-xr-x   1 root supergroup       1070 2019-07-14 15:46 /hive/warehouse/mydb.db/order_partition/dt=20190714/part-00000_copy_4
-rwxr-xr-x   1 root supergroup       7485 2019-07-14 15:46 /hive/warehouse/mydb.db/order_partition/dt=20190714/part-00000_copy_5
-rwxr-xr-x   1 root supergroup       1075 2019-07-14 15:46 /hive/warehouse/mydb.db/order_partition/dt=20190714/part-00000_copy_6
-rwxr-xr-x   1 root supergroup    1227022 2019-07-14 15:46 /hive/warehouse/mydb.db/order_partition/dt=20190714/part-00000_copy_7
-rwxr-xr-x   1 root supergroup       2513 2019-07-14 15:46 /hive/warehouse/mydb.db/order_partition/dt=20190714/part-00000_copy_8
-rwxr-xr-x   1 root supergroup      45638 2019-07-14 15:46 /hive/warehouse/mydb.db/order_partition/dt=20190714/part-00000_copy_9
-rwxr-xr-x   1 root supergroup       8167 2019-07-14 15:46 /hive/warehouse/mydb.db/order_partition/dt=20190714/part-00001
-rwxr-xr-x   1 root supergroup       3799 2019-07-14 15:46 /hive/warehouse/mydb.db/order_partition/dt=20190714/part-00001_copy_1
-rwxr-xr-x   1 root supergroup      38048 2019-07-14 15:46 /hive/warehouse/mydb.db/order_partition/dt=20190714/part-00001_copy_2
-rwxr-xr-x   1 root supergroup       2343 2019-07-14 15:46 /hive/warehouse/mydb.db/order_partition/dt=20190714/part-00002
-rwxr-xr-x   1 root supergroup      36778 2019-07-14 15:46 /hive/warehouse/mydb.db/order_partition/dt=20190714/part-00002_copy_1
-rwxr-xr-x   1 root supergroup        152 2019-07-14 15:46 /hive/warehouse/mydb.db/order_partition/dt=20190714/part-00002_copy_2
-rwxr-xr-x   1 root supergroup       2331 2019-07-14 15:46 /hive/warehouse/mydb.db/order_partition/dt=20190714/part-00003
-rwxr-xr-x   1 root supergroup       5847 2019-07-14 15:46 /hive/warehouse/mydb.db/order_partition/dt=20190714/part-00004
-rwxr-xr-x   1 root supergroup       5820 2019-07-14 15:46 /hive/warehouse/mydb.db/order_partition/dt=20190714/part-00005
[root@master ~]#