以Spark2.X为例,其支持Hash、Range以及自定义分区器。
分区器决定了rdd数据在分布式运算时的分区个数以及数据在shuffle中发往的分区号,而分区的个数决定了reduce的个数;同样的shuffle过程中若分区器定义或选择不合适将大大增加数据倾斜的风险。综上,分区器的重要性不言而喻。
首先要知道
(1)Key-Value类型RDD才有分区器,非Key-Value类型RDD的分区值是None。
(2)每个RDD的分区ID范围为0~numPartitions-1,其决定数据所属分区。
Hash分区
对于给定的key,计算其hashCode并对分区个数取余。如果余数小于0,则用余数+分区的个数(否则加0),最后返回的值就是这个key所属的分区ID
弊端:
可能导致每个分区中数据量的不均匀,导致数据倾斜问题。
源码如下:
/**
* A [[org.apache.spark.Partitioner]] that implements hash-based partitioning using
* Java's `Object.hashCode`.
*
* Java arrays have hashCodes that are based on the arrays' identities rather than their contents,
* so attempting to partition an RDD[Array[_]] or RDD[(Array[_], _)] using a HashPartitioner will
* produce an unexpected or incorrect result.
*/
class HashPartitioner(partitions: Int) extends Partitioner {
require(partitions >= 0, s"Number of partitions ($partitions) cannot be negative.")
def numPartitions: Int = partitions
def getPartition(key: Any): Int = key match {
case null => 0
case _ => Utils.nonNegativeMod(key.hashCode, numPartitions)
}
override def equals(other: Any): Boolean = other match {
case h: HashPartitioner =>
h.numPartitions == numPartitions
case _ =>
false
}
override def hashCode: Int = numPartitions
}
Ranger分区
1、将一定范围内的数据映射到某一分区内,尽量保证数据的均匀分布。
2、分区间的数据是有序的,但分区内的元素不能保证有序的。如分区1内的数据为[1,5)内的整数值且无序,分区2的数据为[5,10)内的整数且无序,但分区1的数据全体小于分区2的数据。
实现过程:
第一步:从整个RDD中抽取样本数据进行排序得到每个分区的最大key值,而后形成一个Array[KEY],也就是key值范围rangeBounds。
第二步:判断key在rangeBounds中所处的范围,给出该key值在下一个RDD中的分区id下标。
部分源码如下:
/**
* A [[org.apache.spark.Partitioner]] that partitions sortable records by range into roughly
* equal ranges. The ranges are determined by sampling the content of the RDD passed in.
*
* @note The actual number of partitions created by the RangePartitioner might not be the same
* as the `partitions` parameter, in the case where the number of sampled records is less than
* the value of `partitions`.
*/
class RangePartitioner[K : Ordering : ClassTag, V](
partitions: Int,
rdd: RDD[_ <: Product2[K, V]],
private var ascending: Boolean = true)
extends Partitioner {
// We allow partitions = 0, which happens when sorting an empty RDD under the default settings.
require(partitions >= 0, s"Number of partitions cannot be negative but found $partitions.")
private var ordering = implicitly[Ordering[K]]
// An array of upper bounds for the first (partitions - 1) partitions
private var rangeBounds: Array[K] = {
if (partitions <= 1) {
Array.empty
} else {
// This is the sample size we need to have roughly balanced output partitions, capped at 1M.
val sampleSize = math.min(20.0 * partitions, 1e6)
// Assume the input partitions are roughly balanced and over-sample a little bit.
val sampleSizePerPartition = math.ceil(3.0 * sampleSize / rdd.partitions.length).toInt
val (numItems, sketched) = RangePartitioner.sketch(rdd.map(_._1), sampleSizePerPartition)
if (numItems == 0L) {
Array.empty
} else {
// If a partition contains much more than the average number of items, we re-sample from it
// to ensure that enough items are collected from that partition.
val fraction = math.min(sampleSize / math.max(numItems, 1L), 1.0)
val candidates = ArrayBuffer.empty[(K, Float)]
val imbalancedPartitions = mutable.Set.empty[Int]
sketched.foreach { case (idx, n, sample) =>
if (fraction * n > sampleSizePerPartition) {
imbalancedPartitions += idx
} else {
// The weight is 1 over the sampling probability.
val weight = (n.toDouble / sample.length).toFloat
for (key <- sample) {
candidates += ((key, weight))
}
}
}
if (imbalancedPartitions.nonEmpty) {
// Re-sample imbalanced partitions with the desired sampling probability.
val imbalanced = new PartitionPruningRDD(rdd.map(_._1), imbalancedPartitions.contains)
val seed = byteswap32(-rdd.id - 1)
val reSampled = imbalanced.sample(withReplacement = false, fraction, seed).collect()
val weight = (1.0 / fraction).toFloat
candidates ++= reSampled.map(x => (x, weight))
}
RangePartitioner.determineBounds(candidates, partitions)
}
}
}
}
}
从上我们知道在获取rangeBounds时涉及到排序,也就是说该分区器要求RDD中的KEY类型是可以排序的。默认情况下采用的hash分区器。If any of the RDDs already has a partitioner, choose that one.Otherwise, we use a default HashPartitioner.
而实际上在生产环境中很多情况下key都是不可排序的复杂数据类型,同时由于实际业务需求的多样性,系统自带的分区器无法满足的情况下,自定义分区器就上场了。
Spark自定义分区器简单示例