请记住,对您的数据进行重新分区是一个相当昂贵的操作。还好,Spark还有一个名为coalesce()的repartition()的优化版本,它允许避免数据移动,但只有在减少RDD分区的数量的时候使用。

一、repartition和coalesce区别

1、coalesce操作只能减少分区,它是使用现有分区来减少shuffer的数据量,在一些具体的情况下,我发现repartition比coalesce更快。
在我的应用程序中,当我们估计的文件数量低于一定的阈值时,重新分区工作会更快。

if(numFiles > 20)
df.coalesce(numFiles).write.mode(SaveMode.Overwrite).parquet(dest)
else
df.repartition(numFiles).write.mode(SaveMode.Overwrite).parquet(dest)

2、repartion操作可以增加分区,也可减少分区,它创建新的分区,进行完全的shuffer操作

二、spark源码

1、coalesce

/**
* Return a new RDD that is reduced into `numPartitions` partitions.
*
* This results in a narrow dependency, e.g. if you go from 1000 partitions
* to 100 partitions, there will not be a shuffle, instead each of the 100
* new partitions will claim 10 of the current partitions.
*
* However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
* this may result in your computation taking place on fewer nodes than
* you like (e.g. one node in the case of numPartitions = 1). To avoid this,
* you can pass shuffle = true. This will add a shuffle step, but means the
* current upstream partitions will be executed in parallel (per whatever
* the current partitioning is).
*
* Note: With shuffle = true, you can actually coalesce to a larger number
* of partitions. This is useful if you have a small number of partitions,
* say 100, potentially with a few partitions being abnormally large. Calling
* coalesce(1000, shuffle = true) will result in 1000 partitions with the
* data distributed using a hash partitioner.
*/
def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null)
: RDD[T] = withScope {
if (shuffle) {
/** Distributes elements evenly across output partitions, starting from a random partition. */
val distributePartition = (index: Int, items: Iterator[T]) => {
var position = (new Random(index)).nextInt(numPartitions)
items.map { t =>
// Note that the hash code of the key will just be the key itself. The HashPartitioner
// will mod it with the number of total partitions.
position = position + 1
(position, t)
}
} : Iterator[(Int, T)]

// include a shuffle step so that our upstream tasks are still distributed
new CoalescedRDD(
new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
new HashPartitioner(numPartitions)),
numPartitions).values
} else {
new CoalescedRDD(this, numPartitions)
}
}

2、repartition

/**
* Return a new RDD that has exactly numPartitions partitions.
*
* Can increase or decrease the level of parallelism in this RDD. Internally, this uses
* a shuffle to redistribute data.
*
* If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
* which can avoid performing a shuffle.
*/
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
coalesce(numPartitions, shuffle = true)
}

三、如何选择分区数量

1、分区数量的影响

(1)太少的分区:您不会充分利用集群中所有可用的内核。
(2)太多的分区:您将管理许多小任务,使得产生过多的开销。(更多的抓取,更多的磁盘搜索,driver app需要跟踪每个任务的状态)
小结:

    在这两者之间,第一个对性能的影响要大得多。

    对于小于1000的分区计数,调度太多的小任务对这一点影响相对较小。

    如果您拥有数以万计的分区数量级,则Spark会非常缓慢。


2、合理设置分区数量

Spark 官方建议,您的群集中每个CPU核心最好有2-3个任务。​​http://spark.apache.org/docs/latest/tuning.html#level-of-parallelism​

[Spark基础]--repartition vs coalesce_spark

3、如何设置分区数(并行度)

(1)Spark提交任务时

./bin/spark-submit \
--class <main-class>
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]


在--conf后面添加:

spark.default.parallelism=

your_partition_number

(2)在transform算子中传入参数

参考元祖操作的Api:​​http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions​


参考:

​https://stackoverflow.com/questions/31610971/spark-repartition-vs-coalesce​

​https://hackernoon.com/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4​

​http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/​

​https://github.com/apache/spark/blob/128c29035b4e7383cc3a9a6c7a9ab6136205ac6c/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L376​

​http://spark.apache.org/docs/latest/tuning.html#level-of-parallelism​

​https://stackoverflow.com/questions/35800795/number-of-partitions-in-rdd-and-performance-in-spark/35804407#35804407​