spark exec core 和exec mem关系 spark repartition和coalesce

转载

mob6454cc6d5f87 2024-01-09 15:34:15

文章标签 大数据 java scala spark apache 文章分类 Spark 大数据

def coalesce(numPartitions: Int, shuffle: Boolean = false,
               partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
              (implicit ord: Ordering[T] = null)
      : RDD[T] = withScope {
    require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")
    if (shuffle) {
      /** Distributes elements evenly across output partitions, starting from a random partition. */
      val distributePartition = (index: Int, items: Iterator[T]) => {
        var position = (new Random(index)).nextInt(numPartitions)
        items.map { t =>
          // Note that the hash code of the key will just be the key itself. The HashPartitioner
          // will mod it with the number of total partitions.
          position = position + 1
          (position, t)
        }
      } : Iterator[(Int, T)]

      // include a shuffle step so that our upstream tasks are still distributed
      new CoalescedRDD(
        new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
        new HashPartitioner(numPartitions)),
        numPartitions,
        partitionCoalescer).values
    } else {
      new CoalescedRDD(this, numPartitions, partitionCoalescer)
    }
  }

coalesce()方法的作用是返回指定一个新的指定分区的RDD 如果是生成一个窄依赖的结果，那么不会发生shuffle。比如：1000个分区被重新设置成10个分区，这样不会发生shuffle。

关于RDD的依赖，这里提一下。RDD的依赖分为两种：窄依赖和宽依赖 窄依赖是指父Rdd的分区最多只能被一个子Rdd的分区所引用，即一个父Rdd的分区对应一个子Rdd的分区，或者多个父Rdd的分区对应一个子Rdd的分区。而宽依赖就是宽依赖是指子RDD的分区依赖于父RDD的多个分区或所有分区，即存在一个父RDD的一个分区对应一个子RDD的多个分区。1个父RDD分区对应多个子RDD分区，这其中又分两种情况：1个父RDD对应所有子RDD分区（未经协同划分的Join）或者1个父RDD对应非全部的多个RDD分区（如groupByKey）。

如下图所示：map就是一种窄依赖，而join则会导致宽依赖

回到刚才的分区，如果分区的数量发生激烈的变化，如设置numPartitions=1，这可能会造成运行计算的节点比你想象的要少，为了避免这个情况，可以设置shuffle=true，从而提高并行度，但这会增加shuffle操作。关于这个分区的激烈的变化情况，比如分区数量从父RDD的几千个分区设置成几个，有可能会遇到这么一个错误。

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 77.0 failed 4 times, most recent failure: Lost task 1.3 in stage 77.0 (TID 6334, 192.168.8.61): java.io.IOException: Unable to acquire 16777216 bytes of memory  
        at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:351)  
        at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:332)  
        at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:461)  
        at org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:139)  
        at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:489)  
        at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)  
        at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)  
        at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate
anonfun

这个错误只要把shuffle设置成true即可解决。当把父RDD的分区数量增大是，比如RDD的分区是100，设置成1000，如果shuffle为false，那就不会起作用。这时候就需要设置shuffle为true了，那么RDD将在shuffle之后返回一个1000个分区的RDD，数据分区方式默认是采用hashpartitioner。最后来看看repartition()方法的源码：

def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {  
   coalesce(numPartitions, shuffle = true)  
 }

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：二十一点纸牌游戏python 纸牌二十一点单机版

下一篇：Java日历记事本课程设计目的 java记事本程序设计报告

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯

spark exec core 和exec mem关系 spark repartition和coalesce

spark exec core 和exec mem关系 spark repartition和coalesce

51CTO博客