流程概述
flume作为生产者监控一个txt文件,该文件里的数据通过py脚本添加,flume将sink设置为kafka的某个topic,这样txt文件一有新增数据,flume就收集数据传到kafka的topic中,启动spark streaming程序消费kafka,spark streaming将从kafak拉取的数据处理后,写入hive表中
环境准备
1.启动zookeeper集群
2.启动kafka(master)
3.启动flume(master)
4.启动hadoop集群
5.启动mysql(master)
6.启动hive(master)
hive建分区表
hive (mydb)> create table order_partition(order_id string,user_id string)
> partitioned by(dt string);
flume配置文件
[root@master conf]# vi flume_kafka.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
# 监控/home/boya/flume_exec_test.txt
a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /home/boya/flume_exec_test.txt
#a1.sinks.k1.type = logger
# 设置kafka接收器
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
# 设置kafka的broker地址和端口号
a1.sinks.k1.brokerList=master:9092
# 设置Kafka的topic
a1.sinks.k1.topic=mytest
# 设置序列化的方式
a1.sinks.k1.serializer.class=kafka.serializer.StringEncoder
# use a channel which buffers events in memory
a1.channels.c1.type=memory
a1.channels.c1.capacity = 100000
a1.channels.c1.transactionCapacity = 1000
# Bind the source and sink to the channel
a1.sources.r1.channels=c1
a1.sinks.k1.channel=c1
JSON数据源
写入txt文件的数据是JSON格式,具体如下
{"order_id": 2539329, "user_id": 1, "eval_set": "prior", "order_number": 1, "order_dow": 2, "hour": 8, "day": 0.0}
{"order_id": 2398795, "user_id": 1, "eval_set": "prior", "order_number": 2, "order_dow": 3, "hour": 7, "day": 15.0}
{"order_id": 473747, "user_id": 1, "eval_set": "prior", "order_number": 3, "order_dow": 3, "hour": 12, "day": 21.0}
...
以JSON数据源格式创建Order类
用以将JSON数据解析成Order类的对象,注意Order里的字段要和JSON数据的各个key要对应
package com.spark_streaming;
public class Orders {
public String order_id;
public String user_id;
public String eval_set;
public String order_number;
public String order_dow;
public String hour;
public String day;
public String getOrder_id() {
return order_id;
}
public void setOrder_id(String order_id) {
this.order_id = order_id;
}
public String getUser_id() {
return user_id;
}
public void setUser_id(String user_id) {
this.user_id = user_id;
}
public String getEval_set() {
return eval_set;
}
public void setEval_set(String eval_set) {
this.eval_set = eval_set;
}
public String getOrder_number() {
return order_number;
}
public void setOrder_number(String order_number) {
this.order_number = order_number;
}
public String getOrder_dow() {
return order_dow;
}
public void setOrder_dow(String order_dow) {
this.order_dow = order_dow;
}
public String getHour() {
return hour;
}
public void setHour(String hour) {
this.hour = hour;
}
public String getDay() {
return day;
}
public void setDay(String day) {
this.day = day;
}
}
spark streaming Receiver方式消费Kafka数据
spark streaming以Receiver方式消费Kafka数据,并将消费的数据以追加方式写入hive里的order_partiton表中
代码大致流程如下:
- 设置好zookeeper地址,kafka topic,batch时间间隔,消费组名称,numThreads
- 创建消费Kafka的DStream,其拉取的数据格式是JSON,并且DStream接收的格式是(header,body),我们只取body
- 定义一个rdd转为DataFrame的方法,该方法里先创建SparkSession,并配置好hive表的自动分区等信息,还要设置支持hive即相当于HiveContext,对传入的rdd里每条String类型的记录,先将其以事先定义好的Orders类型解析成Order对象,但我们只将每个对象的部分字段插入hive表里,于是再以该对象的部分字段构造case class order对象,最后将整个rdd转换为DataFrame,此时的DataFrame就相当于是具有两列字段的二维表了
- 将DStream里的每个rdd,转换为DataFrame,并在DF里增加一列“dt”,其值就是日期,然后用追加的形式插入到hive表中
package com.spark_streaming
import com.alibaba.fastjson.JSON
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
import org.apache.spark.sql.functions._
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.kafka.KafkaUtils
object ReceiverTest {
case class order(order_id:String,user_id:String)
def main(args: Array[String]): Unit = {
/**
* group_id :消费者组
* topic :消费的kafka主题
* exectime :spark streaming划分batch的时间间隔
* zkQuorum :zookeeper集群地址
*/
val dt = "20190714"
val Array(group_id,topic,exectime,zkQuorum) = Array("group_mytest","mytest","2","192.168.230.10:2181")
val sparkConf = new SparkConf()
//.setMaster("local[2]")
.setAppName("Receiver Test")
val ssc = new StreamingContext(sparkConf,Seconds(exectime.toInt))
Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
// 实际生产中要消费的topic不止一个
val topicSet = topic.split(",").toSet
val numThreads = 1
val topicMap = topicSet.map((_,numThreads.toInt)).toMap
// lines接收到的数据格式是(null,e)即(header,body),而我们只想要body
val lines = KafkaUtils.createStream(ssc,zkQuorum,group_id,topicMap).map(_._2)
//lines.map((_,1L)).reduceByKey(_+_).print()
// 生成一个rdd转DF的方法
def rdd2DF(rdd:RDD[String]):DataFrame = {
val spark = SparkSession.builder().appName("rdd2DF")
.config("hive.exec.dynamic.partition","true")
.config("hive.exec.dynamic.partition.mode","nonstrict")
.enableHiveSupport().getOrCreate()
import spark.implicits._
rdd.map{x =>
// x是JSON格式的字符串,把x解析成Orders类型的对象,再将其部分字段构造成我们需要的case class order的对象
val mess = JSON.parseObject(x,classOf[Orders])
order(mess.order_id,mess.user_id)
}.toDF()
}
// Dstream核心处理逻辑,对DStream中每个rdd转换成DF
// 然后通过DF结构将数据追加到hive分区表中,"order_id,user_id,dt"
val log = lines.foreachRDD { rdd =>
val df = rdd2DF(rdd)
df.withColumn("dt", lit(dt.toString))
.write.mode(SaveMode.Append)
.insertInto("mydb.order_partition")
}
ssc.start()
ssc.awaitTermination()
}
}
以yarn-client提交spark streaming应用程序
sparkstreaming提交并启动后,等待手动执行py脚本获取实时数据,注意提交时要带上hive配置文件和mysql连接驱动包
[root@master spark-2.0.2-bin-hadoop2.6]# ./bin/spark-submit --class com.spark_streaming.ReceiverTest --master yarn-client --files $HIVE_HOME/conf/hive-site.xml --jars /home/boya/mysql-connector-java-5.1.44-bin.jar /home/boya/boya-1.0-SNAPSHOT.jar
执行py脚本模拟输入日志文件
执行该脚本时,从orders.csv读数据写入flume_exec_test.txt,flume监控该txt文件(其sink已配置好kafka连接及mytest主题),flume作为生产者将该文件新增的数据收集入kafka的mytest主题中,另一边启动好的spark streaming应用程序作为消费者从kafka中拉取数据消费,最后写入hive分区表中
[root@master boya]# cat flume_data_write.py
# -*- coding: utf-8 -*-
import random
import time
import pandas as pd
import json
writeFileName="./flume_exec_test.txt"
cols = ["order_id","user_id","eval_set","order_number","order_dow","hour","day"]
df1 = pd.read_csv('./orders.csv')
df1.columns = cols
df = df1.fillna(0)
with open(writeFileName,'a+')as wf:
for idx,row in df.iterrows():
d = {}
for col in cols:
d[col]=row[col]
js = json.dumps(d)
wf.write(js+'\n')
# rand_num = random.random()
# time.sleep(rand_num)
因为数据量太多,执行了一会后我就手动kill掉应用程序了,此时观察hive表是否有数据
hive查看表中获得的数据
hive (mydb)> select * from order_partition limit 20;
OK
2539329 1 20190714
2398795 1 20190714
473747 1 20190714
2254736 1 20190714
431534 1 20190714
3367565 1 20190714
550135 1 20190714
3108588 1 20190714
2295261 1 20190714
2550362 1 20190714
1187899 1 20190714
2168274 2 20190714
1501582 2 20190714
1901567 2 20190714
738281 2 20190714
1673511 2 20190714
1199898 2 20190714
3194192 2 20190714
788338 2 20190714
1718559 2 20190714
Time taken: 2.595 seconds, Fetched: 20 row(s)
hive (mydb)>
HDFS查看hive表目录结构
创建的hive表为内部分区表,以dt为分区字段,程序中指定dt = “20190714”,并设置好自动分区
[root@master ~]# hdfs dfs -ls /hive/warehouse/mydb.db/order_partition/dt=20190714
Found 19 items
-rwxr-xr-x 1 root supergroup 986 2019-07-14 15:46 /hive/warehouse/mydb.db/order_partition/dt=20190714/part-00000
-rwxr-xr-x 1 root supergroup 1078 2019-07-14 15:46 /hive/warehouse/mydb.db/order_partition/dt=20190714/part-00000_copy_1
-rwxr-xr-x 1 root supergroup 1071 2019-07-14 15:46 /hive/warehouse/mydb.db/order_partition/dt=20190714/part-00000_copy_2
-rwxr-xr-x 1 root supergroup 3209 2019-07-14 15:46 /hive/warehouse/mydb.db/order_partition/dt=20190714/part-00000_copy_3
-rwxr-xr-x 1 root supergroup 1070 2019-07-14 15:46 /hive/warehouse/mydb.db/order_partition/dt=20190714/part-00000_copy_4
-rwxr-xr-x 1 root supergroup 7485 2019-07-14 15:46 /hive/warehouse/mydb.db/order_partition/dt=20190714/part-00000_copy_5
-rwxr-xr-x 1 root supergroup 1075 2019-07-14 15:46 /hive/warehouse/mydb.db/order_partition/dt=20190714/part-00000_copy_6
-rwxr-xr-x 1 root supergroup 1227022 2019-07-14 15:46 /hive/warehouse/mydb.db/order_partition/dt=20190714/part-00000_copy_7
-rwxr-xr-x 1 root supergroup 2513 2019-07-14 15:46 /hive/warehouse/mydb.db/order_partition/dt=20190714/part-00000_copy_8
-rwxr-xr-x 1 root supergroup 45638 2019-07-14 15:46 /hive/warehouse/mydb.db/order_partition/dt=20190714/part-00000_copy_9
-rwxr-xr-x 1 root supergroup 8167 2019-07-14 15:46 /hive/warehouse/mydb.db/order_partition/dt=20190714/part-00001
-rwxr-xr-x 1 root supergroup 3799 2019-07-14 15:46 /hive/warehouse/mydb.db/order_partition/dt=20190714/part-00001_copy_1
-rwxr-xr-x 1 root supergroup 38048 2019-07-14 15:46 /hive/warehouse/mydb.db/order_partition/dt=20190714/part-00001_copy_2
-rwxr-xr-x 1 root supergroup 2343 2019-07-14 15:46 /hive/warehouse/mydb.db/order_partition/dt=20190714/part-00002
-rwxr-xr-x 1 root supergroup 36778 2019-07-14 15:46 /hive/warehouse/mydb.db/order_partition/dt=20190714/part-00002_copy_1
-rwxr-xr-x 1 root supergroup 152 2019-07-14 15:46 /hive/warehouse/mydb.db/order_partition/dt=20190714/part-00002_copy_2
-rwxr-xr-x 1 root supergroup 2331 2019-07-14 15:46 /hive/warehouse/mydb.db/order_partition/dt=20190714/part-00003
-rwxr-xr-x 1 root supergroup 5847 2019-07-14 15:46 /hive/warehouse/mydb.db/order_partition/dt=20190714/part-00004
-rwxr-xr-x 1 root supergroup 5820 2019-07-14 15:46 /hive/warehouse/mydb.db/order_partition/dt=20190714/part-00005
[root@master ~]#