放电的目录
- 1、什么是RDD
- 2、初始化
- 3、分区(partition)
- mapPartitionsWithIndex【重点】
- 查看分区【重点】
- makeRDD的默认分区规则
- 4、RDD常用算子
- 5、RDD序列化
- 序列化实现代码
- Kryo序列化框架
- 6、RDD持久化
- 7、RDD血缘
- 窄依赖
- 宽依赖
- 任务(Job)和阶段(Stage)的划分
- 代码图
1、什么是RDD
2、初始化
依赖
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.0.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.0.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.12</artifactId>
<version>3.0.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
<version>3.0.2</version>
</dependency>
导包,创建Spark对象
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
// 创建SparkConf对象,并设定配置
val conf = new SparkConf().setAppName("应用程序名称").setMaster("local")
// 创建SparkContext对象,Spark通过该对象访问集群
val sc = new SparkContext(conf)
| 说明 | 核心数 |
| 本地 | 单核 |
| 本地 | 4核 |
| 本地 | 核心数=本地核心数 |
| 让YARN来管理资源 |
3、分区(partition)
mapPartitionsWithIndex【重点】
val conf = new SparkConf().setAppName("名").setMaster("local[*]")
val sc = new SparkContext(conf)
val rdd0: RDD[Int] = sc.makeRDD(Range(0, 4), numSlices = 3)
val rdd1: RDD[(String, Int)] = rdd0.mapPartitionsWithIndex(
(index, items) => items.map(("分区" + index, _))
)
rdd1.collect.foreach(println)
(分区0,0)
(分区1,1)
(分区2,2)
(分区2,3)
查看分区【重点】
val rdd = sc.makeRDD(Range(0, 4), numSlices=5)
// 查看分区数;源码【partitions.length】
println("分区数:" + rdd.getNumPartitions)
分区数:5
// 查看各分区元素
rdd.mapPartitionsWithIndex(
(pId, iter) => {
println("分区" + pId + "元素:" + iter.toList)
iter
}
).collect
分区0元素:List()
分区2元素:List(1)
分区3元素:List(2)
分区4元素:List(3)
分区1元素:List(0)
// 重新分区:repartition、coalesce
rdd.coalesce(3).mapPartitionsWithIndex(
(pId, iter) => {
println("分区" + pId + "元素:" + iter.toList)
iter
}
).collect
分区0元素:List()
分区2元素:List(2, 3)
分区1元素:List(0, 1)
makeRDD的默认分区规则
源码在
org.apache.spark.rdd.ParallelCollectionRDD.slice
里的positions
def positions(length: Long, numSlices: Int): Iterator[(Int, Int)] = {
(0 until numSlices).iterator.map { i =>
val start = ((i * length) / numSlices).toInt
val end = (((i + 1) * length) / numSlices).toInt
(start, end)
}
}
把上面翻译成下面Python代码,逻辑就清晰可见了
def position(num_slices):
for i in range(num_slices):
start = i / num_slices
end = (i + 1) / num_slices
yield start, end
def slice_up(ls, num_slices):
l = len(ls)
for start, end in position(num_slices):
partition = ls[int(start * l): int(end * l)]
print(start, end, partition)
slice_up([1, 2, 3, 4, 5], 4)
0.0 0.25 [1]
0.25 0.5 [2]
0.5 0.75 [3]
0.75 1.0 [4, 5]
4、RDD常用算子
5、RDD序列化
- 序列化:将对象的状态信息转换为可存储或传输的形式的过程
- 跨进程通信时,需要序列化
- 作用提交前(源码
runJob
),其中有行代码val cleanedFunc = clean(func)
用来判断函数是否实现了序列化,这叫做闭包检查
序列化实现代码
示例如下,关键词
Serializable
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.rdd.RDD
object Hello {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("名").setMaster("local[*]")
val sc = new SparkContext(conf)
val r: RDD[Hero] = sc.makeRDD(Seq(new Hero("剑圣"), new Hero("先知")))
r.foreach(print) //打印结果:Hero(先知)Hero(剑圣)
}
}
class Hero(var name: String) extends Serializable {
override def toString: String = s"Hero($name)"
}
Kryo序列化框架
Spark支持另外一种序列化机制,叫Kryo,速度很快,但不是所有东西都能用
import org.apache.spark.{SparkConf, SparkContext}
object Hello {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf()
.setAppName("a1")
.setMaster("local[*]")
// 替换默认的序列化机制
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
// 注册需要使用 Kryo 序列化的自定义类
.registerKryoClasses(Array(classOf[MyFilter]))
val sc = new SparkContext(conf)
}
}
case class MyFilter(name: String) {}
6、RDD持久化
RDD可用cache
方法(底层调用persist
)来缓存(默认在内存)中间结果
import org.apache.spark.{SparkContext, SparkConf}
object Hello {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("a1").setMaster("local")
val sc = new SparkContext(conf)
val r = sc.makeRDD(Seq("a", "b"), 2).map((_, System.currentTimeMillis()))
//查看原血缘
println(r.toDebugString)
//数据缓存(cache操作会增加血缘关系)
r.cache()
//打印,留意时间
r.foreach(print)
println("\n-------------------------------------------------")
//查看更新后的血缘
println(r.toDebugString)
//再次打印,留意时间
r.foreach(print)
}
}
打印结果StorageLevel
源码查看
7、RDD血缘
toDebugString查看RDD全链
import org.apache.spark.{SparkContext, SparkConf}
object Hello {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("名").setMaster("local")
val sc = new SparkContext(conf)
val r0 = sc.makeRDD(Seq("a", "b", "b"))
println(r0.toDebugString)
println("--------------------------------------------------------------------")
val r1 = r0.flatMap(_.split(" "))
println(r1.toDebugString)
println("--------------------------------------------------------------------")
val r2 = r1.map((_, 1))
println(r2.toDebugString)
println("--------------------------------------------------------------------")
val r3 = r2.reduceByKey(_ + _)
println(r3.toDebugString)
}
}
/*
打印结果
(1) ParallelCollectionRDD[0] at makeRDD at Hello.scala:7 []
--------------------------------------------------------------------
(1) MapPartitionsRDD[1] at flatMap at Hello.scala:10 []
| ParallelCollectionRDD[0] at makeRDD at Hello.scala:7 []
--------------------------------------------------------------------
(1) MapPartitionsRDD[2] at map at Hello.scala:13 []
| MapPartitionsRDD[1] at flatMap at Hello.scala:10 []
| ParallelCollectionRDD[0] at makeRDD at Hello.scala:7 []
--------------------------------------------------------------------
(1) ShuffledRDD[3] at reduceByKey at Hello.scala:16 []
+-(1) MapPartitionsRDD[2] at map at Hello.scala:13 []
| MapPartitionsRDD[1] at flatMap at Hello.scala:10 []
| ParallelCollectionRDD[0] at makeRDD at Hello.scala:7 []
*/
dependencies查看依赖(上一个RDD)
import org.apache.spark.{SparkContext, SparkConf}
object Hello {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("名").setMaster("local")
val sc = new SparkContext(conf)
val r0 = sc.makeRDD(Seq("a", "b", "b"))
println(r0.dependencies)
println("--------------------------------------------------------------------")
val r1 = r0.flatMap(_.split(" "))
println(r1.dependencies)
println("--------------------------------------------------------------------")
val r2 = r1.map((_, 1))
println(r2.dependencies)
println("--------------------------------------------------------------------")
val r3 = r2.reduceByKey(_ + _)
println(r3.dependencies)
}
}
/*
打印结果
List()
--------------------------------------------------------------------
List(org.apache.spark.OneToOneDependency@29a4f594)
--------------------------------------------------------------------
List(org.apache.spark.OneToOneDependency@3051e0b2)
--------------------------------------------------------------------
List(org.apache.spark.ShuffleDependency@3f985a86)
*/
窄依赖
每一个父RDD的Partition最多被子RDD的一个Partition使用
宽依赖
ShuffleDependency同一个父RDD的partition被多个子RDD的partition依赖,会引起shuffle
任务(Job)和阶段(Stage)的划分
name | 译名 | 说明 |
Standalone模式 或 YARN模式 | 集群 | 一个Spark集群可以同时运行多个Spark应用 |
Application | 应用 | 一个应用可以并发的运行多个Job |
Job | 作业 | Job对应着应用中的行动算子,每次执行一个行动算子,都会提交一个Job 一个Job由多个Stage组成 |
Stage | 阶段 | 一个宽依赖做一次阶段的划分 一个Stage由多个Task组成 |
Task | 任务 | 每一个阶段最后一个RDD的分区数,就是当前阶段的Task个数 |
代码图
import org.apache.spark.{SparkContext, SparkConf}
val conf = new SparkConf().setAppName("A1").setMaster("local[2]")
val sc = new SparkContext(conf)
val r0 = sc.makeRDD(Seq("a", "b", "b", "c", "c"))
val r1 = r0.map((_, 1))
val r2 = r1.reduceByKey(_ + _)
println(r1.collect.toList) // List((a,1), (b,1), (b,1), (c,1), (c,1))
println(r2.collect.toList) // List((b,2), (a,1), (c,2))
Thread.sleep(999999)
Appendix
en | 🔉 | cn |
spark | spɑːrk | n. 火花;导火线;活力;v. 点燃; |
resilient | /rɪˈzɪliənt/ | 有弹力的 |
parallel | /ˈpærəlel/ | 平行线;平行的 |
parallelize | /'pærəlelaɪz/ | v. 平行放置 |
coalesce | /ˌkoʊəˈles/ | v 合并 |
intersection | /ˌɪntərˈsekʃn/ | 交集 |
subtract | /səbˈtrækt/ | vt. 减去;扣掉 |
aggregate | /ˈæɡrɪɡət/ | n. 合计;adj. 聚合的;v. 聚集 |
glom | /ɡlɑːm/ | vt. 偷;抢;n. 一瞥 |
serializable | /ˈsɪˌriəˌlaɪzəbl/ | 可序列化 |
persist | /pərˈsɪst/ | vi. 存留;固执;vt. 坚持说 |
stage | steɪdʒ | n. 阶段;舞台;戏剧;驿站;v 举行 |