函数原型
def coalesce(numPartitions: Int, shuffle: Boolean = false)
(implicit ord: Ordering[T] = null): RDD[T]
返回一个新的RDD,且该RDD的分区个数等于numPartitions个数。如果shuffle设置为true,则会进行shuffle。
实例
/**
* User: 过往记忆
*/
scala> var data = sc.parallelize(List(1,2,3,4))
data: org.apache.spark.rdd.RDD[Int] =
ParallelCollectionRDD[45] at parallelize at <console>:12
scala> data.partitions.length
res68: Int = 30
scala> val result = data.coalesce(2, false)
result: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[57] at coalesce at <console>:14
scala> result.partitions.length
res77: Int = 2
scala> result.toDebugString
res75: String =
(2) CoalescedRDD[57] at coalesce at <console>:14 []
| ParallelCollectionRDD[45] at parallelize at <console>:12 []
scala> val result1 = data.coalesce(2, true)
result1: org.apache.spark.rdd.RDD[Int] = MappedRDD[61] at coalesce at <console>:14
scala> result1.toDebugString
res76: String =
(2) MappedRDD[61] at coalesce at <console>:14 []
| CoalescedRDD[60] at coalesce at <console>:14 []
| ShuffledRDD[59] at coalesce at <console>:14 []
+-(30) MapPartitionsRDD[58] at coalesce at <console>:14 []
| ParallelCollectionRDD[45] at parallelize at <console>:12 []
从上面可以看出shuffle为false的时候并不进行shuffle操作;而为true的时候会进行shuffle操作。RDD.partitions.length可以获取相关RDD的分区数。