Join如何避免shuffle

在我们使用Spark对数据进行处理的时候最让人头疼的就是业务上复杂的逻辑,而这些逻辑往往不是map算子就能解决的,不是aggragate就是join操作,而这些操作又伴随着shuffle极大地影响了程序执行过程的性能开销。今天我们来讨论下在使用join的时候如何避免shuffle的发生。

一般我们直接使用join的时候都是触发commen join,这种join操作会触发一次stage的切分,因为这个过程中我们需要根据两张表的key进行数据的重写分发,这样就一定会触发shuffle。
想要避免触发shuffle的话,我们只能使用到map Join和bucket join这两种,但是这两种还是会有一定量的网络io,而且后者也不是完全避免了shuffle,而是通过设置一个partitioner来使得join的过程中没有shuffle,而并不是每个rdd的partitioner都刚好符合我们的要求,所以在重设partitioner的过程中肯定不可避免地要进行一定量的shuffle,但这种操作有在某些特定的场合上还是能表现出较好的性能的。

map join

在前几篇的博文中,我也有详细介绍过map join的操作。如果你想在使用SparKSQL的时候程序自动触发map join的话可以设置spark.sql.autoBroadcastJoinThreshold这个参数,这个参数的默认值是10M,就是当你使用大小表进行join的时候,小表小于等于10M的数据量时,spark程序会自动使用map join将小表广播到所有worker node上面。

bucket join

除使用上面这种方式外,在宽窄依赖那篇博文中,我也展示了一幅图出来的。

spark join spark join不走shuffle_spark join


这里刚好有一个join with inputs co-partitioned的操作,这个join操作是窄依赖的,这就意味着没有shuffle发生了。那怎么样我们才能出发这个操作呢?

同样是使用join这个聚合算子,但是需要我们进行join的两个rdd具有相同的partitioner对象,同时在join的使用也需要将这个partitioner对象传进去,这样当三者的partitioner是同一个对象的时候就会自动触发这种join了(没有shuffle)。

我们从源码当中看看是怎么个情况:

//首先我们点入join算子里面,这个join方法是带有一个partitioner: Partitioner这样的一个分区器对象的
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope {
    this.cogroup(other, partitioner).flatMapValues( pair =>
      for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)
    )
  }

接着我们继续点入到cogroup这个方法里面去看看是什么回事:

//这里面我们可以看到这个cogroup方法是返回一个RDD对象的,它new出来了一个CoGroupedRDD
//对象,并将我们要join的两个rdd对象和我们传入的partiitoner对象传进去了
/**
   * For each key k in `this` or `other`, return a resulting RDD that contains a tuple with the
   * list of values for that key in `this` as well as `other`.
   */
  def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner)
      : RDD[(K, (Iterable[V], Iterable[W]))] = self.withScope {
    if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) {
      throw new SparkException("HashPartitioner cannot partition array keys.")
    }
    val cg = new CoGroupedRDD[K](Seq(self, other), partitioner)
    cg.mapValues { case Array(vs, w1s) =>
      (vs.asInstanceOf[Iterable[V]], w1s.asInstanceOf[Iterable[W]])
    }
  }

OK,我们继续进入到这个CoGroupedRDD里面看这个rdd对象的dependency是什么东西:

/**
 * :: DeveloperApi ::
 * An RDD that cogroups its parents. For each key k in parent RDDs, the resulting RDD contains a
 * tuple with the list of values for that key.
 *
 * @param rdds parent RDDs.
 * @param part partitioner used to partition the shuffle output
 *
 * @note This is an internal API. We recommend users use RDD.cogroup(...) instead of
 * instantiating this directly.
 */
@DeveloperApi
class CoGroupedRDD[K: ClassTag](
    @transient var rdds: Seq[RDD[_ <: Product2[K, _]]],
    part: Partitioner)
  extends RDD[(K, Array[Iterable[_]])](rdds.head.context, Nil) {

  // For example, `(k, a) cogroup (k, b)` produces k -> Array(ArrayBuffer as, ArrayBuffer bs).
  // Each ArrayBuffer is represented as a CoGroup, and the resulting Array as a CoGroupCombiner.
  // CoGroupValue is the intermediate state of each value before being merged in compute.
  private type CoGroup = CompactBuffer[Any]
  private type CoGroupValue = (Any, Int)  // Int is dependency number
  private type CoGroupCombiner = Array[CoGroup]

  private var serializer: Serializer = SparkEnv.get.serializer

  /** Set a serializer for this RDD's shuffle, or null to use the default (spark.serializer) */
  def setSerializer(serializer: Serializer): CoGroupedRDD[K] = {
    this.serializer = serializer
    this
  }

  override def getDependencies: Seq[Dependency[_]] = {
    rdds.map { rdd: RDD[_] =>
      if (rdd.partitioner == Some(part)) {
        logDebug("Adding one-to-one dependency with " + rdd)
        new OneToOneDependency(rdd)
      } else {
        logDebug("Adding shuffle dependency with " + rdd)
        new ShuffleDependency[K, Any, CoGroupCombiner](
          rdd.asInstanceOf[RDD[_ <: Product2[K, _]]], part, serializer)
      }
    }
  }

  override def getPartitions: Array[Partition] = {
    val array = new Array[Partition](part.numPartitions)
    for (i <- 0 until array.length) {
      // Each CoGroupPartition will have a dependency per contributing RDD
      array(i) = new CoGroupPartition(i, rdds.zipWithIndex.map { case (rdd, j) =>
        // Assume each RDD contributed a single dependency, and get it
        dependencies(j) match {
          case s: ShuffleDependency[_, _, _] =>
            None
          case _ =>
            Some(new NarrowCoGroupSplitDep(rdd, i, rdd.partitions(i)))
        }
      }.toArray)
    }
    array
  }

  override val partitioner: Some[Partitioner] = Some(part)

  override def compute(s: Partition, context: TaskContext): Iterator[(K, Array[Iterable[_]])] = {
    val split = s.asInstanceOf[CoGroupPartition]
    val numRdds = dependencies.length

    // A list of (rdd iterator, dependency number) pairs
    val rddIterators = new ArrayBuffer[(Iterator[Product2[K, Any]], Int)]
    for ((dep, depNum) <- dependencies.zipWithIndex) dep match {
      case oneToOneDependency: OneToOneDependency[Product2[K, Any]] @unchecked =>
        val dependencyPartition = split.narrowDeps(depNum).get.split
        // Read them from the parent
        val it = oneToOneDependency.rdd.iterator(dependencyPartition, context)
        rddIterators += ((it, depNum))

      case shuffleDependency: ShuffleDependency[_, _, _] =>
        // Read map outputs of shuffle
        val it = SparkEnv.get.shuffleManager
          .getReader(shuffleDependency.shuffleHandle, split.index, split.index + 1, context)
          .read()
        rddIterators += ((it, depNum))
    }

    val map = createExternalMap(numRdds)
    for ((it, depNum) <- rddIterators) {
      map.insertAll(it.map(pair => (pair._1, new CoGroupValue(pair._2, depNum))))
    }
    context.taskMetrics().incMemoryBytesSpilled(map.memoryBytesSpilled)
    context.taskMetrics().incDiskBytesSpilled(map.diskBytesSpilled)
    context.taskMetrics().incPeakExecutionMemory(map.peakMemoryUsedBytes)
    new InterruptibleIterator(context,
      map.iterator.asInstanceOf[Iterator[(K, Array[Iterable[_]])]])
  }

  private def createExternalMap(numRdds: Int)
    : ExternalAppendOnlyMap[K, CoGroupValue, CoGroupCombiner] = {

    val createCombiner: (CoGroupValue => CoGroupCombiner) = value => {
      val newCombiner = Array.fill(numRdds)(new CoGroup)
      newCombiner(value._2) += value._1
      newCombiner
    }
    val mergeValue: (CoGroupCombiner, CoGroupValue) => CoGroupCombiner =
      (combiner, value) => {
      combiner(value._2) += value._1
      combiner
    }
    val mergeCombiners: (CoGroupCombiner, CoGroupCombiner) => CoGroupCombiner =
      (combiner1, combiner2) => {
        var depNum = 0
        while (depNum < numRdds) {
          combiner1(depNum) ++= combiner2(depNum)
          depNum += 1
        }
        combiner1
      }
    new ExternalAppendOnlyMap[K, CoGroupValue, CoGroupCombiner](
      createCombiner, mergeValue, mergeCombiners)
  }

  override def clearDependencies() {
    super.clearDependencies()
    rdds = null
  }
}

这里是CoGroupedRDD的完整源码,但我们只需要关注getDependencies这个方法就可以了

override def getDependencies: Seq[Dependency[_]] = {
    rdds.map { rdd: RDD[_] =>
      if (rdd.partitioner == Some(part)) {
        logDebug("Adding one-to-one dependency with " + rdd)
        new OneToOneDependency(rdd)
      } else {
        logDebug("Adding shuffle dependency with " + rdd)
        new ShuffleDependency[K, Any, CoGroupCombiner](
          rdd.asInstanceOf[RDD[_ <: Product2[K, _]]], part, serializer)
      }
    }
  }

这里有一个地方就是if (rdd.partitioner == Some(part))就new出了一个OneToOneDependency(rdd),这个OneToOneDependency就是我们定义的窄依赖,下面的那个ShuffleDependency就是宽依赖呀。
我们可以看到当rdds(里面包含了我们要join的两个rdd)的partitioner和join方法中传入的那个partitioner都是相同的对象的时候 (记住了喔,是对象相同,不仅仅是相同的类new出3个不同的对象就可以哦,我一开始没认真看就以为是new出3个不同对象也可以)

我这里准备了一份简单的代码,我们可以用debug模式来跑一遍看看情况,在从4040端口看看DAG图

package com.doudou.batch.join

import com.doudou.batch.utils.MyPartitioner
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

object CoPartitionerApp {

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("CoPartitionerApp").setMaster("local[3]")
    val sc = new SparkContext(conf)
    val part = new MyPartitioner(3)

    val d1 = Array((1,"a"),(2,"b"),(3,"c"))
    val d2 = Array((1,"1"),(2,"2"),(3,"3"))

    val r1: RDD[(Int, String)] = sc.parallelize(d1,1)
    val r2: RDD[(Int, String)] = sc.parallelize(d2,1)


    val r3 = r1.join(r2)
    println(r3.collect())


//    val r1_1: RDD[(Int, String)] = r1.partitionBy(part)
//    val r2_1: RDD[(Int, String)] = r2.partitionBy(part)
//
//    val r3_1 = r1_1.join(r2_1,part)
//    println(r3_1.collect())
//    println(r1.partitioner)
//    println(r2.partitioner)
//    println(r1_1.partitioner)
//    println(r2_1.partitioner)


    Thread.sleep(1000 * 60 * 100)

    sc.stop()
  }

}

我们现在r3这里进行debug

spark join spark join不走shuffle_ci_02


到这里了,接着继续走

spark join spark join不走shuffle_ci_03


然后我们点进去看看

spark join spark join不走shuffle_ide_04


这里走的shuffle的依赖吧,它的DAG图如下是触发了shuffle

spark join spark join不走shuffle_ci_05

现在我们走另外一段代码,添加这么一段:

val r1_1: RDD[(Int, String)] = r1.partitionBy(part)
    val r2_1: RDD[(Int, String)] = r2.partitionBy(part)

    val r3_1 = r1_1.join(r2_1,part)
    println(r3_1.collect())
    println(r1.partitioner)
    println(r2.partitioner)
    println(r1_1.partitioner)
    println(r2_1.partitioner)

同样适用debug来看最终走的是哪个dependency

spark join spark join不走shuffle_spark_06


这里的依赖就是窄依赖了,DAG图就是这样了,在join中没有发生shuffle

spark join spark join不走shuffle_spark_07


并且我们可以看看下面在控制台的输出

None

None

Some(com.doudou.batch.utils.MyPartitioner@7d42542)

Some(com.doudou.batch.utils.MyPartitioner@7d42542)

这里我们的r1_1和r2_1的partitioner是相同的,而r1和r2的paritioner是没有的。

总结

虽然在某些场景下我们避免了shuffle,但是要让join的两个rdd的partitioner相同并提取出来传入到join中的场景还是少的,这样我们就要提前将我们要join的rdd重新改造他们的partitioner了,这也是一个带shuffle的过程。
同学们在使用的过程中需要根据自己的业务场景来合理使用了。