RDD中常用transformation算子
0.intersection求交集
功能:是对两个rdd或者两个集合,求共同的部分,比如第一个rdd中有的数据并且第二个rdd中也有的数据,取出相同的元素(会去重)
底层:底层调用的cogroup,map将数据本身当成key,null当成value,然后进行过滤,过滤的条件为,两个迭代器都不为空迭代器,然后调用keys取出key
def intersection(other: RDD[T]): RDD[T] = withScope {
this.map(v => (v, null)).cogroup(other.map(v => (v, null)))
.filter { case (_, (leftGroup, rightGroup)) => leftGroup.nonEmpty && rightGroup.nonEmpty }
.keys
}
1.Subtract差集
功能:将第一个rdd和第二个rdd中相同的部分差掉(差集可以使用在集合或者是RDD中)(不会去重)
要求:两个RDD中的数据类型必须是一致的,数据类型可以是任意类型
底层调用了一个map方法,将值变成k,v(k是集合本身的元素,v是null)的形式,然后调用了subtractByKey,subtractByKey底层是new 了一个subTractRDD,将k,和v放入一个map集合,先将rdd1中的所有数据拼成(k,seq())中,然后遍历第二个rdd,根据第二个rdd中的key移除第一个rdd中的key,底层源码如下:
// the first dep is rdd1; add all values to the map
integrate(0, t => getSeq(t._1) += t._2)
// the second dep is rdd2; remove all of its keys
integrate(1, t => map.remove(t._1))
2.PartitionBy算子(transformation算子)
功能:按照指定的分区器,重新分区
要求:数据类型为K,V类型的才能调用PartitonBy重新分区
底层:如果和传入的分区器和原来的分区器相同,这使用原来的分区器,如果不同则new shuffleRDD,传入新传入的分区器
if (self.partitioner == Some(partitioner)) {
self
} else {
new ShuffledRDD[K, V, V](self, partitioner)
}
3.repartition重新设置分区数量算子
功能:将数据打散,重新分区,可以将分区数量变多也可以将分区数量减少,有两个stage,不管是分区数量是增加还是分区数量减少或者是不变都一定存在shuffle,其中有一个参数:shuffle=true
要求:数据不一定是K,V类型,
底层:调用了coalesce,并且默认shuffle=true,在底层是调用了shuffleRDD
if (shuffle) {
/** Distributes elements evenly across output partitions, starting from a random partition. */
val distributePartition = (index: Int, items: Iterator[T]) => {
var position = new Random(hashing.byteswap32(index)).nextInt(numPartitions)
items.map { t =>
// Note that the hash code of the key will just be the key itself. The HashPartitioner
// will mod it with the number of total partitions.
position = position + 1
(position, t)
}
} : Iterator[(Int, T)]
// include a shuffle step so that our upstream tasks are still distributed
new CoalescedRDD(
new ShuffledRDD[Int, T, T](
mapPartitionsWithIndexInternal(distributePartition, isOrderSensitive = true),
new HashPartitioner(numPartitions)),
numPartitions,
partitionCoalescer).values
} else {
new CoalescedRDD(this, numPartitions, partitionCoalescer)
}
}
4.coalesce将分区汇聚或者合并的算子
功能:可以有shuffle也可以没有(其中有一个参数默认shuffle=false),可以将数据打散,也可以不打散,以4个分区变成2个分区为例:从rdd角度分析,是将4个分区合并成了2个分区,从task角度来说是一个task读多个task的数据,如果将分区数量变多,但是shuffle=false,最后的结果是分区数量不变,这样做没有意义
补充:一个stage中task的数量是这个stage最后一个RDD中的task数量决定
底层:底层是调用coalesce方法,并且默认shuffle: Boolean = false,如果shuffle等于false相当于是new CoalescedRDD对coalesce包了一层
if (shuffle) {
/** Distributes elements evenly across output partitions, starting from a random partition. */
val distributePartition = (index: Int, items: Iterator[T]) => {
var position = new Random(hashing.byteswap32(index)).nextInt(numPartitions)
items.map { t =>
// Note that the hash code of the key will just be the key itself. The HashPartitioner
// will mod it with the number of total partitions.
position = position + 1
(position, t)
}
} : Iterator[(Int, T)]
// include a shuffle step so that our upstream tasks are still distributed
new CoalescedRDD(
new ShuffledRDD[Int, T, T](
mapPartitionsWithIndexInternal(distributePartition, isOrderSensitive = true),
new HashPartitioner(numPartitions)),
numPartitions,
partitionCoalescer).values
} else {
new CoalescedRDD(this, numPartitions, partitionCoalescer)
}
5.shuffle的定义
上游的一个分区的数据,给了下游多个分区的数据,这就是shuffle(严格来说是下游到上游去拉取数据),下游一个分区的数据来自上游多个分区的数据(只要存在这个可能性就有shuffle)
标准定义:在分布式计算中,把数据按照一定的规律打散,条件一:上游的数据分到下游的多个分区当中,条件二:并且下游的一个分区的数据从上游的多个分区拉取数据,这两个条件同时满足就是shuffle
6.sortBy和sortByKey排序算子
排序机制:1.对数据进行采样,可以知道数据的范围,和个数,2.创建一个分区器,rangePartition,将一定范围的数据shuffle到同一个分区(排序是内存加 磁盘进行排序)
sortBy排序规则:是将数据按照一定的规制进行分区,数据在单个分区内的排序是有序的,而且所有的分区又是有序的,所以数据在整个分区内又是有序的.
源码分析:sortBy底层先调用了KeyBy,keyBy底层是调用了map方法,是将数据本身先调用一次函数然后作为K,在将本身作为Value,然后是构建RangePartitioner(构建RangePartitioner,这里底层调用一次collect,用来对数据进行采样 )最后调用了sortByKey,sortByKey底层new shuffleRDD
//sortBy底层代码
/**
* Return this RDD sorted by the given key function.
*/
def sortBy[K](
f: (T) => K,
ascending: Boolean = true,
numPartitions: Int = this.partitions.length)
(implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] = withScope {
this.keyBy[K](f)//底层是调用map方法,将数据处理为k,v
.sortByKey(ascending, numPartitions)
.values
}
//sortByKey底层代码
/**
* Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling
* `collect` or `save` on the resulting RDD will return or output an ordered list of records
* (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in
* order of the keys).
*/
// TODO: this currently doesn't work on P other than Tuple2!
def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)
: RDD[(K, V)] = self.withScope
{
val part = new RangePartitioner(numPartitions, self, ascending)
new ShuffledRDD[K, V, V](self, part)//将构建好的分区器传入到new ShuffledRDD
.setKeyOrdering(if (ascending) ordering else ordering.reverse)
}
//这里是构建了rangePartitioner分区器
// An array of upper bounds for the first (partitions - 1) partitions
private var rangeBounds: Array[K] = {
if (partitions <= 1) {
Array.empty
} else {
// This is the sample size we need to have roughly balanced output partitions, capped at 1M.
// Cast to double to avoid overflowing ints or longs
val sampleSize = math.min(samplePointsPerPartitionHint.toDouble * partitions, 1e6)
// Assume the input partitions are roughly balanced and over-sample a little bit.
val sampleSizePerPartition = math.ceil(3.0 * sampleSize / rdd.partitions.length).toInt
val (numItems, sketched) = RangePartitioner.sketch(rdd.map(_._1), sampleSizePerPartition)
if (numItems == 0L) {
Array.empty
} else {
// If a partition contains much more than the average number of items, we re-sample from it
// to ensure that enough items are collected from that partition.
val fraction = math.min(sampleSize / math.max(numItems, 1L), 1.0)
val candidates = ArrayBuffer.empty[(K, Float)]
val imbalancedPartitions = mutable.Set.empty[Int]
sketched.foreach { case (idx, n, sample) =>
if (fraction * n > sampleSizePerPartition) {
imbalancedPartitions += idx
} else {
// The weight is 1 over the sampling probability.
val weight = (n.toDouble / sample.length).toFloat
for (key <- sample) {
candidates += ((key, weight))
}
}
}
if (imbalancedPartitions.nonEmpty) {
// Re-sample imbalanced partitions with the desired sampling probability.
val imbalanced = new PartitionPruningRDD(rdd.map(_._1), imbalancedPartitions.contains)
val seed = byteswap32(-rdd.id - 1)
//这里调用了action算子,会生成一个job目的是为了对数据进行采样然后生成分区器
val reSampled = imbalanced.sample(withReplacement = false, fraction, seed).collect()
val weight = (1.0 / fraction).toFloat
candidates ++= reSampled.map(x => (x, weight))
}
RangePartitioner.determineBounds(candidates, math.min(partitions, candidates.size))
}
}
}
总结:
1.sortBy可以理解为桶排序,将数据分在不同的桶内,数据在每个桶内是有序的,并且所有桶在全局又是有序的,所以数据在全局是有序的,能做到这种排序的最强的功能是RangePartitioner(将一定范围内的数据分到不同的桶中)
2.sortBy是transformation算子只是内部底层默认调用了collect(action算子),目的是为了构建RangePartitioner,对数据进行采样(为了知道数据的范围)
RDD中常用的action算子
7.sum和reduce(action算子)
功能: sum返回值是double类型,是先局部调用聚合函数,然后全局在调用聚合函数,局部和全局都是在executor端执行的,只是将最后的结果返回到了driver端
sum底层是调用了
def sum(): Double = self.withScope {
self.fold(0.0)(_ + _)
}
def fold(zeroValue: T)(op: (T, T) => T): T = withScope {
// Clone the zero value since we will also be serializing it as part of tasks
var jobResult = Utils.clone(zeroValue, sc.env.closureSerializer.newInstance())
val cleanOp = sc.clean(op)
val foldPartition = (iter: Iterator[T]) => iter.fold(zeroValue)(cleanOp)
val mergeResult = (_: Int, taskResult: T) => jobResult = op(jobResult, taskResult)
sc.runJob(this, foldPartition, mergeResult)
jobResult
}
reduce相比sum来说更加灵活,可以传入聚合函数,对数据进行乘或者是除的逻辑,也是先局部聚合在全局聚合
8.aggregate算子( aggregate(初始值)(局部函数,全局函数) )
功能:是一个柯里化方法,传入两个函数,第一个函数在分区内聚合,第二个函数在全局聚合,初始值在局部聚合的时候调用一次,在全局聚合的时候调用一次;
结果分析:从执行的结果可以看出,出现结果的先后顺序可能不一样,说明两个task执行是并行计算的,那个task先执行完,那个task就先前面
package com.zxx.spark.day05
import org.apache.spark.rdd.RDD
import org.apache.spark.{HashPartitioner, SparkConf, SparkContext}
object AggregateDemo1 {
def main(args: Array[String]): Unit = {
//先创建和Spark集群链接对象
val conf: SparkConf = new SparkConf().setAppName(this.getClass.getName).setMaster("local[*]")
val sc: SparkContext = new SparkContext(conf)
//创建两个rdd
val rdd1: RDD[(String,Int)] = sc.parallelize(List(("spark",1),("hive",2),("spark",1),("flink",1),("flink",1),("kafka",2),("hadoop",2),("hadoop",6)),2)
val rdd: String = rdd1.aggregate("#")(_ + _, _ + _)
//执行结果一:##(spark,1)(hive,2)(spark,1)(flink,1)#(flink,1)(kafka,2)(hadoop,2)(hadoop,6)
//执行结果二:##(flink,1)(kafka,2)(hadoop,2)(hadoop,6)#(spark,1)(hive,2)(spark,1)(flink,1)
println(rdd)
//从执行的结果可以看出,出现结果的先后顺序可能不一样,说明两个task执行是并行计算的,那个task先执行完,那个task就先前面
}
}
9.foreach和foreachPartition
foreach和foreachPartition都是action算子,foreach是一条一条的取数据,foreachPartition可以获取一个迭代器
foreach:将数据一条一条的取出来,传入一个函数,这个函数返回Unit,比如传入一个打印的逻辑,打印的结果在Executor端的日志中
foreachPartition:以分区位单位,每一个分区就是一个Task,以后可以将数据写入到数据库中,一个分区一个连接,效率更高
10.min和max
功能:就集合中的元素的最小值或者是最大值
底层是调用了reduce算子,reduce底层又调用了fold,先求出每个分区的最小值,然后在全局求出每个分区的最小值,最后调用了runJob,将结果返回到driver端
/**
* Returns the min of this RDD as defined by the implicit Ordering[T].
* @return the minimum element of the RDD
* */
def min()(implicit ord: Ordering[T]): T = withScope {
this.reduce(ord.min)
}
/**
* Returns the max of this RDD as defined by the implicit Ordering[T].
* @return the maximum element of the RDD
* */
def max()(implicit ord: Ordering[T]): T = withScope {
this.reduce(ord.max)
}
11.take和takeOrder算子
功能:take(n)从分区中取出n个元素,默认先从0分区取,如果0分区的数小于n要取的元素,则在从下一个分区中取,如果0分区的元素够,则只从0分区中取,这个算子不需要将每个分区进行排序
takeOrdered和top类似,只不过以和top相反的顺序返回元素。takeOrder底层有排序
12.count算子
功能:求数据源总共有多少条数据,先计算每个分区的数量,(读一条数据,则累加一条),将累加完成的每个分区的总数放在一个数组中,返回给driver端,然后在driver端调用sum求和
/**
* Return the number of elements in the RDD.
*最本质是调用了sc.runJob,生成DAG
*/
def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
/**
* Counts the number of elements of an iterator using a while loop rather than calling
* [[scala.collection.Iterator#size]] because it uses a for loop, which is slightly slower
* in the current version of Scala.
*/
//对每个分区进行迭代,将数据取一条然后累加一条,最后返回给客户端,在客户端进行sum求和
def getIteratorSize(iterator: Iterator[_]): Long = {
var count = 0L
while (iterator.hasNext) {
count += 1L
iterator.next()
}
count
}
13.top算子
Top底层调用的是takeOrdere,调用MapPartitions先在每一个分区将输放入到有界优先队列,每个分区返回的有界优先再进行++=,没有shuffle
/**
* Returns the top k (largest) elements from this RDD as defined by the specified
* implicit Ordering[T] and maintains the ordering. This does the opposite of
* [[takeOrdered]]. For example:
* {{{
* sc.parallelize(Seq(10, 4, 2, 12, 3)).top(1)
* // returns Array(12)
*
* sc.parallelize(Seq(2, 3, 4, 5, 6)).top(2)
* // returns Array(6, 5)
* }}}
*
* @note This method should only be used if the resulting array is expected to be small, as
* all the data is loaded into the driver's memory.
*
* @param num k, the number of top elements to return
* @param ord the implicit ordering for T
* @return an array of top elements
*/
def top(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {
takeOrdered(num)(ord.reverse)
}
14.sum
功能:是先每个分区求总和,然后在全局求总和,和reduce相比,reduce算子虽然也可以局部 聚合然后在全局聚合,但是reduce更加的灵活
/**
* Extra functions available on RDDs of Doubles through an implicit conversion.
*/
class DoubleRDDFunctions(self: RDD[Double]) extends Logging with Serializable {
/** Add up the elements in this RDD. */
def sum(): Double = self.withScope {
self.fold(0.0)(_ + _)
}
//fold的底层是先局部聚合然后在全局聚合