pivot 算子 sparksql spark count算子

转载

mob6454cc64c0a4 2023-11-13 14:31:37

文章标签 pivot 算子 sparksql spark 数据 flink 文章分类 Spark 大数据

RDD中常用transformation算子

0.intersection求交集

功能:是对两个rdd或者两个集合,求共同的部分,比如第一个rdd中有的数据并且第二个rdd中也有的数据,取出相同的元素(会去重)

底层:底层调用的cogroup，map将数据本身当成key，null当成value，然后进行过滤，过滤的条件为，两个迭代器都不为空迭代器，然后调用keys取出key

def intersection(other: RDD[T]): RDD[T] = withScope {
    this.map(v => (v, null)).cogroup(other.map(v => (v, null)))
        .filter { case (_, (leftGroup, rightGroup)) => leftGroup.nonEmpty && rightGroup.nonEmpty }
        .keys
  }

1.Subtract差集

功能:将第一个rdd和第二个rdd中相同的部分差掉(差集可以使用在集合或者是RDD中)(不会去重)

要求:两个RDD中的数据类型必须是一致的,数据类型可以是任意类型

底层调用了一个map方法,将值变成k,v(k是集合本身的元素,v是null)的形式,然后调用了subtractByKey,subtractByKey底层是new 了一个subTractRDD,将k,和v放入一个map集合,先将rdd1中的所有数据拼成(k,seq())中,然后遍历第二个rdd,根据第二个rdd中的key移除第一个rdd中的key,底层源码如下:

// the first dep is rdd1; add all values to the map
integrate(0, t => getSeq(t._1) += t._2)
// the second dep is rdd2; remove all of its keys
integrate(1, t => map.remove(t._1))

2.PartitionBy算子(transformation算子)

功能:按照指定的分区器,重新分区

要求:数据类型为K,V类型的才能调用PartitonBy重新分区

底层:如果和传入的分区器和原来的分区器相同,这使用原来的分区器,如果不同则new shuffleRDD,传入新传入的分区器

if (self.partitioner == Some(partitioner)) {
  self
} else {
  new ShuffledRDD[K, V, V](self, partitioner)
}

3.repartition重新设置分区数量算子

功能:将数据打散,重新分区,可以将分区数量变多也可以将分区数量减少,有两个stage,不管是分区数量是增加还是分区数量减少或者是不变都一定存在shuffle,其中有一个参数:shuffle=true

要求:数据不一定是K,V类型,

底层:调用了coalesce,并且默认shuffle=true,在底层是调用了shuffleRDD

if (shuffle) {
      /** Distributes elements evenly across output partitions, starting from a random partition. */
      val distributePartition = (index: Int, items: Iterator[T]) => {
        var position = new Random(hashing.byteswap32(index)).nextInt(numPartitions)
        items.map { t =>
          // Note that the hash code of the key will just be the key itself. The HashPartitioner
          // will mod it with the number of total partitions.
          position = position + 1
          (position, t)
        }
      } : Iterator[(Int, T)]

      // include a shuffle step so that our upstream tasks are still distributed
      new CoalescedRDD(
        new ShuffledRDD[Int, T, T](
          mapPartitionsWithIndexInternal(distributePartition, isOrderSensitive = true),
          new HashPartitioner(numPartitions)),
        numPartitions,
        partitionCoalescer).values
    } else {
      new CoalescedRDD(this, numPartitions, partitionCoalescer)
    }
  }

4.coalesce将分区汇聚或者合并的算子

功能:可以有shuffle也可以没有(其中有一个参数默认shuffle=false),可以将数据打散,也可以不打散,以4个分区变成2个分区为例:从rdd角度分析,是将4个分区合并成了2个分区,从task角度来说是一个task读多个task的数据,如果将分区数量变多,但是shuffle=false,最后的结果是分区数量不变,这样做没有意义

补充:一个stage中task的数量是这个stage最后一个RDD中的task数量决定

底层:底层是调用coalesce方法,并且默认shuffle: Boolean = false,如果shuffle等于false相当于是new CoalescedRDD对coalesce包了一层

if (shuffle) {
      /** Distributes elements evenly across output partitions, starting from a random partition. */
      val distributePartition = (index: Int, items: Iterator[T]) => {
        var position = new Random(hashing.byteswap32(index)).nextInt(numPartitions)
        items.map { t =>
          // Note that the hash code of the key will just be the key itself. The HashPartitioner
          // will mod it with the number of total partitions.
          position = position + 1
          (position, t)
        }
      } : Iterator[(Int, T)]

      // include a shuffle step so that our upstream tasks are still distributed
      new CoalescedRDD(
        new ShuffledRDD[Int, T, T](
          mapPartitionsWithIndexInternal(distributePartition, isOrderSensitive = true),
          new HashPartitioner(numPartitions)),
        numPartitions,
        partitionCoalescer).values
    } else {
      new CoalescedRDD(this, numPartitions, partitionCoalescer)
    }

5.shuffle的定义

上游的一个分区的数据,给了下游多个分区的数据,这就是shuffle(严格来说是下游到上游去拉取数据),下游一个分区的数据来自上游多个分区的数据(只要存在这个可能性就有shuffle)

标准定义:在分布式计算中,把数据按照一定的规律打散,条件一:上游的数据分到下游的多个分区当中,条件二:并且下游的一个分区的数据从上游的多个分区拉取数据,这两个条件同时满足就是shuffle

6.sortBy和sortByKey排序算子

排序机制:1.对数据进行采样,可以知道数据的范围,和个数,2.创建一个分区器,rangePartition,将一定范围的数据shuffle到同一个分区(排序是内存加磁盘进行排序)

sortBy排序规则:是将数据按照一定的规制进行分区,数据在单个分区内的排序是有序的,而且所有的分区又是有序的,所以数据在整个分区内又是有序的.

源码分析:sortBy底层先调用了KeyBy,keyBy底层是调用了map方法,是将数据本身先调用一次函数然后作为K,在将本身作为Value,然后是构建RangePartitioner(构建RangePartitioner,这里底层调用一次collect,用来对数据进行采样 )最后调用了sortByKey,sortByKey底层new shuffleRDD

//sortBy底层代码
/**
   * Return this RDD sorted by the given key function.
   */
  def sortBy[K](
      f: (T) => K,
      ascending: Boolean = true,
      numPartitions: Int = this.partitions.length)
      (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] = withScope {
    this.keyBy[K](f)//底层是调用map方法,将数据处理为k,v
        .sortByKey(ascending, numPartitions)
        .values
  }

//sortByKey底层代码
  /**
   * Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling
   * `collect` or `save` on the resulting RDD will return or output an ordered list of records
   * (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in
   * order of the keys).
   */
  // TODO: this currently doesn't work on P other than Tuple2!
  def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)
      : RDD[(K, V)] = self.withScope
  {
    val part = new RangePartitioner(numPartitions, self, ascending)
    new ShuffledRDD[K, V, V](self, part)//将构建好的分区器传入到new ShuffledRDD
      .setKeyOrdering(if (ascending) ordering else ordering.reverse)
  }

//这里是构建了rangePartitioner分区器
// An array of upper bounds for the first (partitions - 1) partitions
  private var rangeBounds: Array[K] = {
    if (partitions <= 1) {
      Array.empty
    } else {
      // This is the sample size we need to have roughly balanced output partitions, capped at 1M.
      // Cast to double to avoid overflowing ints or longs
      val sampleSize = math.min(samplePointsPerPartitionHint.toDouble * partitions, 1e6)
      // Assume the input partitions are roughly balanced and over-sample a little bit.
      val sampleSizePerPartition = math.ceil(3.0 * sampleSize / rdd.partitions.length).toInt
      val (numItems, sketched) = RangePartitioner.sketch(rdd.map(_._1), sampleSizePerPartition)
      if (numItems == 0L) {
        Array.empty
      } else {
        // If a partition contains much more than the average number of items, we re-sample from it
        // to ensure that enough items are collected from that partition.
        val fraction = math.min(sampleSize / math.max(numItems, 1L), 1.0)
        val candidates = ArrayBuffer.empty[(K, Float)]
        val imbalancedPartitions = mutable.Set.empty[Int]
        sketched.foreach { case (idx, n, sample) =>
          if (fraction * n > sampleSizePerPartition) {
            imbalancedPartitions += idx
          } else {
            // The weight is 1 over the sampling probability.
            val weight = (n.toDouble / sample.length).toFloat
            for (key <- sample) {
              candidates += ((key, weight))
            }
          }
        }
        if (imbalancedPartitions.nonEmpty) {
          // Re-sample imbalanced partitions with the desired sampling probability.
          val imbalanced = new PartitionPruningRDD(rdd.map(_._1), imbalancedPartitions.contains)
          val seed = byteswap32(-rdd.id - 1)
//这里调用了action算子,会生成一个job目的是为了对数据进行采样然后生成分区器
          val reSampled = imbalanced.sample(withReplacement = false, fraction, seed).collect()
          val weight = (1.0 / fraction).toFloat
          candidates ++= reSampled.map(x => (x, weight))
        }
        RangePartitioner.determineBounds(candidates, math.min(partitions, candidates.size))
      }
    }
  }

总结:

1.sortBy可以理解为桶排序,将数据分在不同的桶内,数据在每个桶内是有序的,并且所有桶在全局又是有序的,所以数据在全局是有序的,能做到这种排序的最强的功能是RangePartitioner(将一定范围内的数据分到不同的桶中)

2.sortBy是transformation算子只是内部底层默认调用了collect(action算子),目的是为了构建RangePartitioner,对数据进行采样(为了知道数据的范围)

RDD中常用的action算子

7.sum和reduce(action算子)

功能: sum返回值是double类型,是先局部调用聚合函数,然后全局在调用聚合函数,局部和全局都是在executor端执行的,只是将最后的结果返回到了driver端

sum底层是调用了

def sum(): Double = self.withScope {
  self.fold(0.0)(_ + _)
}
def fold(zeroValue: T)(op: (T, T) => T): T = withScope {
  // Clone the zero value since we will also be serializing it as part of tasks
  var jobResult = Utils.clone(zeroValue, sc.env.closureSerializer.newInstance())
  val cleanOp = sc.clean(op)
  val foldPartition = (iter: Iterator[T]) => iter.fold(zeroValue)(cleanOp)
  val mergeResult = (_: Int, taskResult: T) => jobResult = op(jobResult, taskResult)
  sc.runJob(this, foldPartition, mergeResult)
  jobResult
}

reduce相比sum来说更加灵活,可以传入聚合函数,对数据进行乘或者是除的逻辑,也是先局部聚合在全局聚合

8.aggregate算子( aggregate(初始值)(局部函数,全局函数) )

功能:是一个柯里化方法,传入两个函数，第一个函数在分区内聚合，第二个函数在全局聚合,初始值在局部聚合的时候调用一次,在全局聚合的时候调用一次;

结果分析:从执行的结果可以看出,出现结果的先后顺序可能不一样,说明两个task执行是并行计算的,那个task先执行完,那个task就先前面

package com.zxx.spark.day05

import org.apache.spark.rdd.RDD
import org.apache.spark.{HashPartitioner, SparkConf, SparkContext}

object AggregateDemo1 {
  def main(args: Array[String]): Unit = {
    //先创建和Spark集群链接对象
    val conf: SparkConf = new SparkConf().setAppName(this.getClass.getName).setMaster("local[*]")
    val sc: SparkContext = new SparkContext(conf)
    //创建两个rdd
    val rdd1: RDD[(String,Int)] = sc.parallelize(List(("spark",1),("hive",2),("spark",1),("flink",1),("flink",1),("kafka",2),("hadoop",2),("hadoop",6)),2)
    val rdd: String = rdd1.aggregate("#")(_ + _, _ + _)
    //执行结果一:##(spark,1)(hive,2)(spark,1)(flink,1)#(flink,1)(kafka,2)(hadoop,2)(hadoop,6)
    //执行结果二:##(flink,1)(kafka,2)(hadoop,2)(hadoop,6)#(spark,1)(hive,2)(spark,1)(flink,1)
    println(rdd)

//从执行的结果可以看出,出现结果的先后顺序可能不一样,说明两个task执行是并行计算的,那个task先执行完,那个task就先前面

  }

}

9.foreach和foreachPartition

foreach和foreachPartition都是action算子,foreach是一条一条的取数据,foreachPartition可以获取一个迭代器

foreach:将数据一条一条的取出来，传入一个函数，这个函数返回Unit，比如传入一个打印的逻辑，打印的结果在Executor端的日志中

foreachPartition:以分区位单位，每一个分区就是一个Task，以后可以将数据写入到数据库中，一个分区一个连接，效率更高

10.min和max

功能:就集合中的元素的最小值或者是最大值

底层是调用了reduce算子,reduce底层又调用了fold,先求出每个分区的最小值,然后在全局求出每个分区的最小值,最后调用了runJob,将结果返回到driver端

/**
   * Returns the min of this RDD as defined by the implicit Ordering[T].
   * @return the minimum element of the RDD
   * */
  def min()(implicit ord: Ordering[T]): T = withScope {
    this.reduce(ord.min)
  }

/**
   * Returns the max of this RDD as defined by the implicit Ordering[T].
   * @return the maximum element of the RDD
   * */
  def max()(implicit ord: Ordering[T]): T = withScope {
    this.reduce(ord.max)
  }

11.take和takeOrder算子

功能:take(n)从分区中取出n个元素,默认先从0分区取,如果0分区的数小于n要取的元素,则在从下一个分区中取,如果0分区的元素够,则只从0分区中取,这个算子不需要将每个分区进行排序

takeOrdered和top类似，只不过以和top相反的顺序返回元素。takeOrder底层有排序

12.count算子

功能:求数据源总共有多少条数据,先计算每个分区的数量,(读一条数据,则累加一条),将累加完成的每个分区的总数放在一个数组中,返回给driver端,然后在driver端调用sum求和

/**
   * Return the number of elements in the RDD.
*最本质是调用了sc.runJob,生成DAG
   */
  def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum

  /**
   * Counts the number of elements of an iterator using a while loop rather than calling
   * [[scala.collection.Iterator#size]] because it uses a for loop, which is slightly slower
   * in the current version of Scala.
   */
//对每个分区进行迭代,将数据取一条然后累加一条,最后返回给客户端,在客户端进行sum求和

  def getIteratorSize(iterator: Iterator[_]): Long = {
    var count = 0L
    while (iterator.hasNext) {
      count += 1L
      iterator.next()
    }
    count
  }

13.top算子

Top底层调用的是takeOrdere，调用MapPartitions先在每一个分区将输放入到有界优先队列，每个分区返回的有界优先再进行++=,没有shuffle

/**
   * Returns the top k (largest) elements from this RDD as defined by the specified
   * implicit Ordering[T] and maintains the ordering. This does the opposite of
   * [[takeOrdered]]. For example:
   * {{{
   *   sc.parallelize(Seq(10, 4, 2, 12, 3)).top(1)
   *   // returns Array(12)
   *
   *   sc.parallelize(Seq(2, 3, 4, 5, 6)).top(2)
   *   // returns Array(6, 5)
   * }}}
   *
   * @note This method should only be used if the resulting array is expected to be small, as
   * all the data is loaded into the driver's memory.
   *
   * @param num k, the number of top elements to return
   * @param ord the implicit ordering for T
   * @return an array of top elements
   */
  def top(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {
    takeOrdered(num)(ord.reverse)
  }

14.sum

功能:是先每个分区求总和,然后在全局求总和,和reduce相比,reduce算子虽然也可以局部聚合然后在全局聚合,但是reduce更加的灵活

/**
 * Extra functions available on RDDs of Doubles through an implicit conversion.
 */
class DoubleRDDFunctions(self: RDD[Double]) extends Logging with Serializable {
  /** Add up the elements in this RDD. */
  def sum(): Double = self.withScope {
    self.fold(0.0)(_ + _)
  }

//fold的底层是先局部聚合然后在全局聚合

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。