Join如何避免shuffle
在我们使用Spark对数据进行处理的时候最让人头疼的就是业务上复杂的逻辑,而这些逻辑往往不是map算子就能解决的,不是aggragate就是join操作,而这些操作又伴随着shuffle极大地影响了程序执行过程的性能开销。今天我们来讨论下在使用join的时候如何避免shuffle的发生。
一般我们直接使用join的时候都是触发commen join,这种join操作会触发一次stage的切分,因为这个过程中我们需要根据两张表的key进行数据的重写分发,这样就一定会触发shuffle。
想要避免触发shuffle的话,我们只能使用到map Join和bucket join这两种,但是这两种还是会有一定量的网络io,而且后者也不是完全避免了shuffle,而是通过设置一个partitioner来使得join的过程中没有shuffle,而并不是每个rdd的partitioner都刚好符合我们的要求,所以在重设partitioner的过程中肯定不可避免地要进行一定量的shuffle,但这种操作有在某些特定的场合上还是能表现出较好的性能的。
map join
在前几篇的博文中,我也有详细介绍过map join的操作。如果你想在使用SparKSQL的时候程序自动触发map join的话可以设置spark.sql.autoBroadcastJoinThreshold这个参数,这个参数的默认值是10M,就是当你使用大小表进行join的时候,小表小于等于10M的数据量时,spark程序会自动使用map join将小表广播到所有worker node上面。
bucket join
除使用上面这种方式外,在宽窄依赖那篇博文中,我也展示了一幅图出来的。
这里刚好有一个join with inputs co-partitioned的操作,这个join操作是窄依赖的,这就意味着没有shuffle发生了。那怎么样我们才能出发这个操作呢?
同样是使用join这个聚合算子,但是需要我们进行join的两个rdd具有相同的partitioner对象,同时在join的使用也需要将这个partitioner对象传进去,这样当三者的partitioner是同一个对象的时候就会自动触发这种join了(没有shuffle)。
我们从源码当中看看是怎么个情况:
//首先我们点入join算子里面,这个join方法是带有一个partitioner: Partitioner这样的一个分区器对象的
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope {
this.cogroup(other, partitioner).flatMapValues( pair =>
for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)
)
}
接着我们继续点入到cogroup这个方法里面去看看是什么回事:
//这里面我们可以看到这个cogroup方法是返回一个RDD对象的,它new出来了一个CoGroupedRDD
//对象,并将我们要join的两个rdd对象和我们传入的partiitoner对象传进去了
/**
* For each key k in `this` or `other`, return a resulting RDD that contains a tuple with the
* list of values for that key in `this` as well as `other`.
*/
def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner)
: RDD[(K, (Iterable[V], Iterable[W]))] = self.withScope {
if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) {
throw new SparkException("HashPartitioner cannot partition array keys.")
}
val cg = new CoGroupedRDD[K](Seq(self, other), partitioner)
cg.mapValues { case Array(vs, w1s) =>
(vs.asInstanceOf[Iterable[V]], w1s.asInstanceOf[Iterable[W]])
}
}
OK,我们继续进入到这个CoGroupedRDD里面看这个rdd对象的dependency是什么东西:
/**
* :: DeveloperApi ::
* An RDD that cogroups its parents. For each key k in parent RDDs, the resulting RDD contains a
* tuple with the list of values for that key.
*
* @param rdds parent RDDs.
* @param part partitioner used to partition the shuffle output
*
* @note This is an internal API. We recommend users use RDD.cogroup(...) instead of
* instantiating this directly.
*/
@DeveloperApi
class CoGroupedRDD[K: ClassTag](
@transient var rdds: Seq[RDD[_ <: Product2[K, _]]],
part: Partitioner)
extends RDD[(K, Array[Iterable[_]])](rdds.head.context, Nil) {
// For example, `(k, a) cogroup (k, b)` produces k -> Array(ArrayBuffer as, ArrayBuffer bs).
// Each ArrayBuffer is represented as a CoGroup, and the resulting Array as a CoGroupCombiner.
// CoGroupValue is the intermediate state of each value before being merged in compute.
private type CoGroup = CompactBuffer[Any]
private type CoGroupValue = (Any, Int) // Int is dependency number
private type CoGroupCombiner = Array[CoGroup]
private var serializer: Serializer = SparkEnv.get.serializer
/** Set a serializer for this RDD's shuffle, or null to use the default (spark.serializer) */
def setSerializer(serializer: Serializer): CoGroupedRDD[K] = {
this.serializer = serializer
this
}
override def getDependencies: Seq[Dependency[_]] = {
rdds.map { rdd: RDD[_] =>
if (rdd.partitioner == Some(part)) {
logDebug("Adding one-to-one dependency with " + rdd)
new OneToOneDependency(rdd)
} else {
logDebug("Adding shuffle dependency with " + rdd)
new ShuffleDependency[K, Any, CoGroupCombiner](
rdd.asInstanceOf[RDD[_ <: Product2[K, _]]], part, serializer)
}
}
}
override def getPartitions: Array[Partition] = {
val array = new Array[Partition](part.numPartitions)
for (i <- 0 until array.length) {
// Each CoGroupPartition will have a dependency per contributing RDD
array(i) = new CoGroupPartition(i, rdds.zipWithIndex.map { case (rdd, j) =>
// Assume each RDD contributed a single dependency, and get it
dependencies(j) match {
case s: ShuffleDependency[_, _, _] =>
None
case _ =>
Some(new NarrowCoGroupSplitDep(rdd, i, rdd.partitions(i)))
}
}.toArray)
}
array
}
override val partitioner: Some[Partitioner] = Some(part)
override def compute(s: Partition, context: TaskContext): Iterator[(K, Array[Iterable[_]])] = {
val split = s.asInstanceOf[CoGroupPartition]
val numRdds = dependencies.length
// A list of (rdd iterator, dependency number) pairs
val rddIterators = new ArrayBuffer[(Iterator[Product2[K, Any]], Int)]
for ((dep, depNum) <- dependencies.zipWithIndex) dep match {
case oneToOneDependency: OneToOneDependency[Product2[K, Any]] @unchecked =>
val dependencyPartition = split.narrowDeps(depNum).get.split
// Read them from the parent
val it = oneToOneDependency.rdd.iterator(dependencyPartition, context)
rddIterators += ((it, depNum))
case shuffleDependency: ShuffleDependency[_, _, _] =>
// Read map outputs of shuffle
val it = SparkEnv.get.shuffleManager
.getReader(shuffleDependency.shuffleHandle, split.index, split.index + 1, context)
.read()
rddIterators += ((it, depNum))
}
val map = createExternalMap(numRdds)
for ((it, depNum) <- rddIterators) {
map.insertAll(it.map(pair => (pair._1, new CoGroupValue(pair._2, depNum))))
}
context.taskMetrics().incMemoryBytesSpilled(map.memoryBytesSpilled)
context.taskMetrics().incDiskBytesSpilled(map.diskBytesSpilled)
context.taskMetrics().incPeakExecutionMemory(map.peakMemoryUsedBytes)
new InterruptibleIterator(context,
map.iterator.asInstanceOf[Iterator[(K, Array[Iterable[_]])]])
}
private def createExternalMap(numRdds: Int)
: ExternalAppendOnlyMap[K, CoGroupValue, CoGroupCombiner] = {
val createCombiner: (CoGroupValue => CoGroupCombiner) = value => {
val newCombiner = Array.fill(numRdds)(new CoGroup)
newCombiner(value._2) += value._1
newCombiner
}
val mergeValue: (CoGroupCombiner, CoGroupValue) => CoGroupCombiner =
(combiner, value) => {
combiner(value._2) += value._1
combiner
}
val mergeCombiners: (CoGroupCombiner, CoGroupCombiner) => CoGroupCombiner =
(combiner1, combiner2) => {
var depNum = 0
while (depNum < numRdds) {
combiner1(depNum) ++= combiner2(depNum)
depNum += 1
}
combiner1
}
new ExternalAppendOnlyMap[K, CoGroupValue, CoGroupCombiner](
createCombiner, mergeValue, mergeCombiners)
}
override def clearDependencies() {
super.clearDependencies()
rdds = null
}
}
这里是CoGroupedRDD的完整源码,但我们只需要关注getDependencies这个方法就可以了
override def getDependencies: Seq[Dependency[_]] = {
rdds.map { rdd: RDD[_] =>
if (rdd.partitioner == Some(part)) {
logDebug("Adding one-to-one dependency with " + rdd)
new OneToOneDependency(rdd)
} else {
logDebug("Adding shuffle dependency with " + rdd)
new ShuffleDependency[K, Any, CoGroupCombiner](
rdd.asInstanceOf[RDD[_ <: Product2[K, _]]], part, serializer)
}
}
}
这里有一个地方就是if (rdd.partitioner == Some(part))就new出了一个OneToOneDependency(rdd),这个OneToOneDependency就是我们定义的窄依赖,下面的那个ShuffleDependency就是宽依赖呀。
我们可以看到当rdds(里面包含了我们要join的两个rdd)的partitioner和join方法中传入的那个partitioner都是相同的对象的时候 (记住了喔,是对象相同,不仅仅是相同的类new出3个不同的对象就可以哦,我一开始没认真看就以为是new出3个不同对象也可以)
我这里准备了一份简单的代码,我们可以用debug模式来跑一遍看看情况,在从4040端口看看DAG图
package com.doudou.batch.join
import com.doudou.batch.utils.MyPartitioner
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object CoPartitionerApp {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("CoPartitionerApp").setMaster("local[3]")
val sc = new SparkContext(conf)
val part = new MyPartitioner(3)
val d1 = Array((1,"a"),(2,"b"),(3,"c"))
val d2 = Array((1,"1"),(2,"2"),(3,"3"))
val r1: RDD[(Int, String)] = sc.parallelize(d1,1)
val r2: RDD[(Int, String)] = sc.parallelize(d2,1)
val r3 = r1.join(r2)
println(r3.collect())
// val r1_1: RDD[(Int, String)] = r1.partitionBy(part)
// val r2_1: RDD[(Int, String)] = r2.partitionBy(part)
//
// val r3_1 = r1_1.join(r2_1,part)
// println(r3_1.collect())
// println(r1.partitioner)
// println(r2.partitioner)
// println(r1_1.partitioner)
// println(r2_1.partitioner)
Thread.sleep(1000 * 60 * 100)
sc.stop()
}
}
我们现在r3这里进行debug
到这里了,接着继续走
然后我们点进去看看
这里走的shuffle的依赖吧,它的DAG图如下是触发了shuffle
现在我们走另外一段代码,添加这么一段:
val r1_1: RDD[(Int, String)] = r1.partitionBy(part)
val r2_1: RDD[(Int, String)] = r2.partitionBy(part)
val r3_1 = r1_1.join(r2_1,part)
println(r3_1.collect())
println(r1.partitioner)
println(r2.partitioner)
println(r1_1.partitioner)
println(r2_1.partitioner)
同样适用debug来看最终走的是哪个dependency
这里的依赖就是窄依赖了,DAG图就是这样了,在join中没有发生shuffle
并且我们可以看看下面在控制台的输出
None
None
Some(com.doudou.batch.utils.MyPartitioner@7d42542)
Some(com.doudou.batch.utils.MyPartitioner@7d42542)
这里我们的r1_1和r2_1的partitioner是相同的,而r1和r2的paritioner是没有的。
总结
虽然在某些场景下我们避免了shuffle,但是要让join的两个rdd的partitioner相同并提取出来传入到join中的场景还是少的,这样我们就要提前将我们要join的rdd重新改造他们的partitioner了,这也是一个带shuffle的过程。
同学们在使用的过程中需要根据自己的业务场景来合理使用了。