[Spark基础]--repartition vs coalesce

原创

high2011 2022-11-03 14:37:58 博主文章分类：Spark ©著作权

©著作权归作者所有：来自51CTO博客作者high2011的原创作品，请联系作者获取转载授权，否则将追究法律责任

请记住，对您的数据进行重新分区是一个相当昂贵的操作。还好，Spark还有一个名为coalesce（）的repartition（）的优化版本，它允许避免数据移动，但只有在减少RDD分区的数量的时候使用。

一、repartition和coalesce区别

1、coalesce操作只能减少分区，它是使用现有分区来减少shuffer的数据量，在一些具体的情况下，我发现repartition比coalesce更快。
在我的应用程序中，当我们估计的文件数量低于一定的阈值时，重新分区工作会更快。

if(numFiles > 20)
    df.coalesce(numFiles).write.mode(SaveMode.Overwrite).parquet(dest)
else
    df.repartition(numFiles).write.mode(SaveMode.Overwrite).parquet(dest)

2、repartion操作可以增加分区，也可减少分区，它创建新的分区，进行完全的shuffer操作

二、spark源码

1、coalesce

/**
   * Return a new RDD that is reduced into `numPartitions` partitions.
   *
   * This results in a narrow dependency, e.g. if you go from 1000 partitions
   * to 100 partitions, there will not be a shuffle, instead each of the 100
   * new partitions will claim 10 of the current partitions.
   *
   * However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
   * this may result in your computation taking place on fewer nodes than
   * you like (e.g. one node in the case of numPartitions = 1). To avoid this,
   * you can pass shuffle = true. This will add a shuffle step, but means the
   * current upstream partitions will be executed in parallel (per whatever
   * the current partitioning is).
   *
   * Note: With shuffle = true, you can actually coalesce to a larger number
   * of partitions. This is useful if you have a small number of partitions,
   * say 100, potentially with a few partitions being abnormally large. Calling
   * coalesce(1000, shuffle = true) will result in 1000 partitions with the
   * data distributed using a hash partitioner.
   */
  def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null)
      : RDD[T] = withScope {
    if (shuffle) {
      /** Distributes elements evenly across output partitions, starting from a random partition. */
      val distributePartition = (index: Int, items: Iterator[T]) => {
        var position = (new Random(index)).nextInt(numPartitions)
        items.map { t =>
          // Note that the hash code of the key will just be the key itself. The HashPartitioner
          // will mod it with the number of total partitions.
          position = position + 1
          (position, t)
        }
      } : Iterator[(Int, T)]

      // include a shuffle step so that our upstream tasks are still distributed
      new CoalescedRDD(
        new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
        new HashPartitioner(numPartitions)),
        numPartitions).values
    } else {
      new CoalescedRDD(this, numPartitions)
    }
  }

2、repartition

/**
   * Return a new RDD that has exactly numPartitions partitions.
   *
   * Can increase or decrease the level of parallelism in this RDD. Internally, this uses
   * a shuffle to redistribute data.
   *
   * If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
   * which can avoid performing a shuffle.
   */
  def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    coalesce(numPartitions, shuffle = true)
  }

三、如何选择分区数量

1、分区数量的影响

(1)太少的分区：您不会充分利用集群中所有可用的内核。
(2)太多的分区：您将管理许多小任务，使得产生过多的开销。（更多的抓取，更多的磁盘搜索，driver app需要跟踪每个任务的状态）
小结：

在这两者之间，第一个对性能的影响要大得多。

对于小于1000的分区计数，调度太多的小任务对这一点影响相对较小。

如果您拥有数以万计的分区数量级，则Spark会非常缓慢。

2、合理设置分区数量

Spark 官方建议，您的群集中每个CPU核心最好有2-3个任务。http://spark.apache.org/docs/latest/tuning.html#level-of-parallelism

[Spark基础]--repartition vs coalesce_spark

3、如何设置分区数(并行度)

(1)Spark提交任务时

./bin/spark-submit \
  --class <main-class>
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other options
  <application-jar> \
  [application-arguments]

在--conf后面添加：

spark.default.parallelism=

your_partition_number