第一章: 上次课回顾
第二章:mapWithState的使用
- 2.1 数据写到外部系统中去
- 2.2 foreachRDD的设计模式
- 2.3 foreachRDD的案例
- 2.4 ConnectionPool的案例
第三章:Window编程(了解)
第四章:tranform操作(重要)
第一章:上次课回顾
我们首先要明白SparkStreaming是对core的一个扩展,目的为了处理实时数据。
1、Spark:以批处理为主,用微批处理来处理流数据;
2、Flink:以流处理为主,用流处理来处理批数据;
Spark streaming已经不加入新特性了,结构化流编程方式类似DF、DS。
数据是从外部接进来:注意有无Receiver(local[1]和local[2]的区别),数据源接进来后会变成InputDStream;
有无Receiver打开源码就能知道;
Source进来后就变成了 --> DStream(一系列的RDD) --> Transformation --> Output;
DStream做任何一个Transformation其实就是对一系列的RDD作用上相同的算子。
编程的入口点是StreamingContext。
Core的数量一定是要大于Receiver的数量,否则接收到的数据是无法进行后续处理。
完成从某一个时间到当前时间段的,要使用更新UpdateStateByKey,要使用checkpoint指定输出路径,但是会产生小文件,解决是写到外部数据库中去。
第二章:mapWithState的使用
UpdateStateByKey是老版本中的,新版本中推荐我们使用mapWithState.
注意:它是一个实验性的算子。
package SparkStreaming02
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, State, StateSpec, StreamingContext}
object Test {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setAppName("TestApp").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf,Seconds(10))
ssc.checkpoint(".")
val lines = ssc.socketTextStream("hadoop002",8888)
val result = lines.flatMap(_.split("\t")).map((_,1)).reduceByKey(_+_)
val mappingFunc = (word: String, value: Option[Int], state: State[Int]) => {
val sum = value.getOrElse(0) + state.getOption().getOrElse(0)
state.update(sum)
(word,sum)
}
val state = result.mapWithState(StateSpec.function(mappingFunc))
state.print()
ssc.start()
ssc.awaitTermination()
}
}
测试:在hadoop002机器上:使用nc -lk 8888
- 和updateStateByKey的区别是:UI展示的区别
在生产过程中关于处理结果肯定是输出到:RDBMS/NOSQL
2.1 数据写到外部存储系统中去
- Output operations allow DStream’s data to be pushed out to external systems like a database or a file systems. Since the output operation actually allow the transformed data to be consumed by external systems, they trigger the actual execution of all the DStream transformations (similar to actions for RDDs)
翻译:允许DStream的数据push到外部存储系统,有如下算子:
生产上不建议使用如下三个算子,也不常用:
- saveAsTextFiles(prefix,suffix)
- saveAsObjectFiles(prefix,suffix)
- saveAsHadoopFiles(prefix,suffix)
foreachRDD(func):
概念:
- The most generic(通用) output operator that applies a function(作用上一个函数),func, to each RDD generated(生成) from the stream.This function should push the data in each RDD to an external system(一个外部系统),such as saving the RDD to files(保存RDD到文件中), or writing it over the network to a database(通过网络传输把它写入到数据库).Note that the function (func) is executed in the driver process running the streaming application(运行Streaming程序的时候在Driver端执行的), and will usually have RDD actions in it that will force the computation of the streaming RDDs.
处理结果写到Mysql中,虚拟机中启动MySQL,使用数据库g6,创建数据库表:
1、g6数据库中创建wc数据表:
create table wc(
word varchar(20) default null,
cnt int(10)
)
查看foreachRDD方法描述:
1、Apply a function to each RDD in this DStream. This is an output operator, so
‘this’ DStream will be registered as an output stream and therefore materialized.
- 作用函数到每一个RDD上去,这是一个输出操作,所以这个DStream将被注册以一个输出流以此来实现。
报错:要实现序列化,就引出了foreachRDD的设计模式:
2.2 使用foreachRDD的设计模式
- dstream.foreachRDD is a powerful primitive(原始的) that allows data to be sent out to external systems(外部系统). However, it is important to understand how to use this primitive correctly and efficiently. Some of the common mistakes to avoid are as follows.
一些常见的误解如下:
- Often writing data to external system requires creating a connection object(写数据到外部系统需要创建一个连接对象)(eg. TCP connection to a remote server)and using it to send data to a remote system(发送数据到远端系统). For this purpose, a developer may inadvertently(无意中) try creating a connection object at the Spark driver(在Spark Driver上创建一个链接对象), and then try to use it in a Spark worker to save records in the RDDs
如下这段代码是一个错误的示范:
dstream.foreachRDD { rdd =>
val connection = createNewConnection() //executed at the driver
rdd.foreach { record =>
connection.send(record) //executed at the worker
}
}
- This is incorrect at this requires the connection object to be serialized and sent from the driver to the worker. Such connection objects are rarely transferable across machines(很少能跨机器间转移). This error may manifest as serialization errors(显示为序列化错误)(connection object not serializable), initialization errors (初始化错误)(connection object needs to be initialized at the workers), etc. The correct solution is to create the connection object at the worker.
- However,this can lead to another common mistake - creating a new connection for every record. (会产生新的问题,每一条记录都会发起一个connection)
2.3 foreachRDD的案例
foreachRDD使用IDEA代码实现:
package SparkStreaming02
import java.sql.DriverManager
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, State, StateSpec, StreamingContext}
object Test {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setAppName("TestApp").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf,Seconds(10))
ssc.checkpoint(".")
val lines = ssc.socketTextStream("hadoop002",8888)
val result = lines.flatMap(_.split("\t")).map((_,1)).reduceByKey(_+_)
val mappingFunc = (word: String, value: Option[Int], state: State[Int]) => {
val sum = value.getOrElse(0) + state.getOption().getOrElse(0)
state.update(sum)
(word,sum)
}
val state = result.mapWithState(StateSpec.function(mappingFunc))
// state.print()
state.foreachRDD(rdd => {
val connection = getConnection()
rdd.foreach(kv => {
val sql = s"insert into wc(word,cnt) values ('${kv._1}','${kv._2}')"
connection.createStatement().execute(sql)
})
})
ssc.start()
ssc.awaitTermination()
}
def getConnection()= {
Class.forName("com.mysql.jdbc.Driver")
DriverManager.getConnection("jdbc://mysql://hadoop002:3306/g6","root","960210")
}
}
好的方式是使用foreachPartition:
思路:foreachRDD --> foreachPartiiton --> foreach
package SparkStreaming02
import java.sql.DriverManager
import org.apache.spark.SparkConf
import org.apache.spark.internal.Logging
import org.apache.spark.streaming.{Seconds, State, StateSpec, StreamingContext}
object Test extends Logging{
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setAppName("TestApp").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf,Seconds(10))
ssc.checkpoint(".")
val lines = ssc.socketTextStream("hadoop002",8888)
val result = lines.flatMap(_.split("\t")).map((_,1)).reduceByKey(_+_)
val mappingFunc = (word: String, value: Option[Int], state: State[Int]) => {
val sum = value.getOrElse(0) + state.getOption().getOrElse(0)
state.update(sum)
(word,sum)
}
val state = result.mapWithState(StateSpec.function(mappingFunc))
// state.print()
state.foreachRDD((rdd,time) => {
rdd.foreachPartition(partitionOfRecords => {
if (partitionOfRecords.size > 0){
val connection = getConnection()
logError("--------")
partitionOfRecords.foreach( kv => {
val sql = s"insert into wc(word,cnt) values ('${kv._1}','${kv._2}')"
connection.createStatement().execute(sql)
})
connection.close()
}
})
})
ssc.start()
ssc.awaitTermination()
}
def getConnection()= {
Class.forName("com.mysql.jdbc.Driver")
DriverManager.getConnection("jdbc://mysql://hadoop002:3306/g6","root","960210")
}
}
这段代码有点问题:size判断的问题。
解决办法如下:
dstream.foreachRDD { rdd =>
rdd.foreach { record =>
val connection = createNewConnection() //把connection移进来
connection.send(record)
connection.close()
}
}
扩充打印日志级别为error:
object test extends Logging
This amortizes(平摊) the connection creation overheads(连接消耗开销) over many records.
测试:
代码修改完后:nc -lk 8888后开始测试:
每一条记录打开了一个connection,性能很低。
自己实现一个功能:k v time upsert
2.4 ConnectionPool的案例:
更好的方法是使用ConnectionPool:
- Finally, this can be further optimized(优化) by reusing connection objects across multiple RDD/batches. One can maintain a static pool of a connection objects than can be reused as RDDs of multiple batches are pushed to the external system, thus further reducing the overheads.
- 可以维护一个连接对象的静态池,当多个批的RDDs被推送到外部系统时,可以重用该连接对象,从而进一步减少开销。
IDEA编程:
代码还可以改进,加一个私有的pool,外部方法只能get和return;只暴露给外面两个方法。
一个partition中创建一个链接,如果分区数远大于线程池呢,要么等待,或者直接多拿点资源出来。
SparkStreaming写数据库的唯一一条正确的线路:
- foreachRDD ==> foreachPartition ==> foreach
怎么写到HBase、MongoDB、Redis呢??
- 学习永无止境,学习一种通用的方法。
第三章:Window编程(了解)
概念:
- Spark Streaming also provides windowed computations, which allow you to apply transformations over a sliding window of data(数据滑动窗口). The following figure illustrates this sliding window.
time1 time2 time3 time4 time5,5个批次,一秒一个批次,第一个窗口统计的是time1到3的数据,第二个窗口统计的是time3到5的数据。
两个属性:
1、窗口长度(window length):
2、滑动间隔(sliding interval):
第四章:transform操作(重要)
transform(func):
- Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream.This can be used to do arbitrary RDD operations on the DStream.
目前是有一个DataStream,有一份数据是文本的,我们需要使用DStream和文本作关联操作,
DStream 和 RDD混合,怎么做处理?
val spamInfoRDD = ssc.sparkContext.newAPIHadoopRDD(....) //RDD containing span information
val cleanDStream = wordCounts.tranform {
rdd.join(spamInfoRDD).filter(....) // join data stream with spam information to do data cleaning
.......
}
应用:黑名单 ==> 双写;我们正在处理一批日志,上了新业务,先上20%,这部分的数据怎么和原有的数据做区分;
我们通过hadoop002控制台:nc -lk 8888 输入一些信息:ruoze jepson 17er
以后不管再输入17er,打印出来的都会过滤掉17er这个信息。
package SparkStreaming02
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
object TransformApp {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setAppName("TestApp").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf,Seconds(10))
//TODO业务逻辑
val lines = ssc.socketTextStream("hadoop002",8888)
val blacks = List("17er")
val blackRDD = ssc.sparkContext.parallelize(blacks).map( x => (x,true))
/*我们控制台输入进来的数据laoer,3,2 --> (名字,年龄,性别)
* ==>
* (laoer,<laoer,3,2>)
*/
val result = lines.map( x => (x.split(",")(0),x))
.transform(rdd => {
rdd.leftOuterJoin(blackRDD)
.filter(x => x._2._2.getOrElse(false) == true)
.map(x => x._2._1) //laoer,3,2
})
result.print()
}
}
这不是一个好的方式,我们去Spark UI上查看DAG图:
需要掌握的是:怎么样DStream转换成RDD和正常的RDD之间操作。