文章目录
行动算子会触发spark RDD 的一系列操作,action操作不会生成新的RDD,而是将RDD中封装输出到scala类型的实例中或直接到外部存储系统中,将RDD 中数据输出到外部系统中也是action操作,这里不做介绍
collect
将RDD转换为array 数组
源码
/**
* Return an array that contains all of the elements in this RDD.
*
* @note This method should only be used if the resulting array is expected to be small, as
* all the data is loaded into the driver's memory.
*/
def collect(): Array[T] = withScope {
val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
Array.concat(results: _*)
}first
提取RDD中第一个元素,获取时不会对RDD进行排序操作
/**
* Return the first element in this RDD.
*/
def first(): T = withScope {
take(1) match {
case Array(t) => t case _ => throw new UnsupportedOperationException("empty collection")
}
}take
返回前n个元素
/**
* Take the first num elements of the RDD. It works by first scanning one partition, and use the
* results from that partition to estimate the number of additional partitions needed to satisfy
* the limit.
*
* @note This method should only be used if the resulting array is expected to be small, as
* all the data is loaded into the driver's memory.
*
* @note Due to complications in the internal implementation, this method will raise
* an exception if called on an RDD of `Nothing` or `Null`.
*/
def take(num: Int): Array[T] = withScope {
val scaleUpFactor = Math.max(conf.getInt("spark.rdd.limit.scaleUpFactor", 4), 2)
if (num == 0) {
new Array[T](0)
} else {
val buf = new ArrayBuffer[T]
val totalParts = this.partitions.length
var partsScanned = 0
while (buf.size < num && partsScanned < totalParts) {
// The number of partitions to try in this iteration. It is ok for this number to be
// greater than totalParts because we actually cap it at totalParts in runJob.
var numPartsToTry = 1L
val left = num - buf.size if (partsScanned > 0) {
// If we didn't find any rows after the previous iteration, quadruple and retry.
// Otherwise, interpolate the number of partitions we need to try, but overestimate
// it by 50%. We also cap the estimation in the end.
if (buf.isEmpty) {
numPartsToTry = partsScanned * scaleUpFactor } else {
// As left > 0, numPartsToTry is always >= 1
numPartsToTry = Math.ceil(1.5 * left * partsScanned / buf.size).toInt
numPartsToTry = Math.min(numPartsToTry, partsScanned * scaleUpFactor)
}
}
val p = partsScanned.until(math.min(partsScanned + numPartsToTry, totalParts).toInt)
val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p)
res.foreach(buf ++= _.take(num - buf.size))
partsScanned += p.size }
buf.toArray }
}top
对RDD进行降序排序,返回前n个元素
源码
/**
* Returns the top k (largest) elements from this RDD as defined by the specified
* implicit Ordering[T] and maintains the ordering. This does the opposite of
* [[takeOrdered]]. For example:
* {{{
* sc.parallelize(Seq(10, 4, 2, 12, 3)).top(1)
* // returns Array(12)
*
* sc.parallelize(Seq(2, 3, 4, 5, 6)).top(2)
* // returns Array(6, 5)
* }}}
*
* @note This method should only be used if the resulting array is expected to be small, as
* all the data is loaded into the driver's memory.
*
* @param num k, the number of top elements to return
* @param ord the implicit ordering for T
* @return an array of top elements
*/
def top(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {
takeOrdered(num)(ord.reverse)
}其中隐式函数可以指定排序的参数是哪个
import org.apache.spark.{HashPartitioner, SparkConf, SparkContext}object WriteSpark {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[*]").setAppName("WriteSpark")
val sc = new SparkContext(conf)
val realDate1 = sc.parallelize(Array(("a", 1), ("b", 2), ("c", 3), ("f", 4)), 3)
(3)((_._2))
sc.stop()
}}takeOrdered
源码
/**
* Returns the first k (smallest) elements from this RDD as defined by the specified
* implicit Ordering[T] and maintains the ordering. This does the opposite of [[top]].
* For example:
* {{{
* sc.parallelize(Seq(10, 4, 2, 12, 3)).takeOrdered(1)
* // returns Array(2)
*
* sc.parallelize(Seq(2, 3, 4, 5, 6)).takeOrdered(2)
* // returns Array(2, 3)
* }}}
*
* @note This method should only be used if the resulting array is expected to be small, as
* all the data is loaded into the driver's memory.
*
* @param num k, the number of elements to return
* @param ord the implicit ordering for T
* @return an array of top elements
*/
def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {
if (num == 0) {
Array.empty } else {
val mapRDDs = mapPartitions { items =>
// Priority keeps the largest elements, so let's reverse the ordering.
val queue = new BoundedPriorityQueue[T](num)(ord.reverse)
queue ++= collectionUtils.takeOrdered(items, num)(ord)
Iterator.single(queue)
}
if (mapRDDs.partitions.length == 0) {
Array.empty } else {
mapRDDs.reduce { (queue1, queue2) =>
queue1 ++= queue2
queue1 }.toArray.sorted(ord)
}
}
}雷同top,只是top在源码中指定了降序
import org.apache.spark.{HashPartitioner, SparkConf, SparkContext}object WriteSpark {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[*]").setAppName("WriteSpark")
val sc = new SparkContext(conf)
val realDate1 = sc.parallelize(Array(("a", 1), ("b", 2), ("c", 3), ("f", 4)), 3)
realDate1.takeOrdered(3)((_._2))
sc.stop()
}}reduce
/**
* Reduces the elements of this RDD using the specified commutative and
* associative binary operator.
*/
def reduce(f: (T, T) => T): T = withScope {
val cleanF = sc.clean(f)
val reducePartition: Iterator[T] => Option[T] = iter => {
if (iter.hasNext) {
Some(iter.reduceLeft(cleanF))
} else {
None }
}
var jobResult: Option[T] = None
val mergeResult = (index: Int, taskResult: Option[T]) => {
if (taskResult.isDefined) {
jobResult = jobResult match {
case Some(value) => Some(f(value, taskResult.get))
case None => taskResult }
}
}
sc.runJob(this, reducePartition, mergeResult)
// Get the final result out of our Option, or throw an exception if the RDD was empty
jobResult.getOrElse(throw new UnsupportedOperationException("empty collection"))
}聚合RDD中每个元素
import org.apache.spark.{HashPartitioner, SparkConf, SparkContext}object WriteSpark {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[*]").setAppName("WriteSpark")
val sc = new SparkContext(conf)
val realDate1 = sc.parallelize(Array(("a", 1), ("b", 2), ("c", 3), ("f", 4)), 3)
realDate1.reduce((t1,t2) => (t1._1+"_"+t2._1,t2._2+t2._2))
sc.stop()
}}(c_f_b_a,2)
aggregate
源码
/**
* Aggregate the elements of each partition, and then the results for all the partitions, using
* given combine functions and a neutral "zero value". This function can return a different result
* type, U, than the type of this RDD, T. Thus, we need one operation for merging a T into an U
* and one operation for merging two U's, as in scala.TraversableOnce. Both of these functions are
* allowed to modify and return their first argument instead of creating a new U to avoid memory
* allocation.
*
* @param zeroValue the initial value for the accumulated result of each partition for the
* `seqOp` operator, and also the initial value for the combine results from
* different partitions for the `combOp` operator - this will typically be the
* neutral element (e.g. `Nil` for list concatenation or `0` for summation)
* @param seqOp an operator used to accumulate results within a partition
* @param combOp an associative operator used to combine results from different partitions
*/
def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U = withScope {
// Clone the zero value since we will also be serializing it as part of tasks
var jobResult = Utils.clone(zeroValue, sc.env.serializer.newInstance())
val cleanSeqOp = sc.clean(seqOp)
val cleanCombOp = sc.clean(combOp)
val aggregatePartition = (it: Iterator[T]) => it.aggregate(zeroValue)(cleanSeqOp, cleanCombOp)
val mergeResult = (index: Int, taskResult: U) => jobResult = combOp(jobResult, taskResult)
sc.runJob(this, aggregatePartition, mergeResult)
jobResult }zeroValue
设置聚合时初始值
seqOp
将RDD中每个元素聚合到类型为U的对象中
comOp
跨分区聚合,对数据进行最终的汇总时调用此操作
object Chapter5_2_1_7 {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("Chapter5_2_1_7")
val sc = new SparkContext(conf)
import collection.mutable.ListBuffer
val rddData1 = sc.parallelize(
Array(
("用户1", "接口1"),
("用户2", "接口1"),
("用户1", "接口1"),
("用户1", "接口2"),
("用户2", "接口3")),
2)
val result = rddData1.aggregate(ListBuffer[(String)]())(
(list: ListBuffer[String], tuple: (String, String)) => list += tuple._2,
(list1: ListBuffer[String], list2: ListBuffer[String]) => list1 ++= list2)
println(result)
sc.stop()
}}fold
将元素进行聚合,与aggregate操作类似,是aggregate操作简化版
源码
/**
* Aggregate the elements of each partition, and then the results for all the partitions, using a
* given associative function and a neutral "zero value". The function
* op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object
* allocation; however, it should not modify t2.
*
* This behaves somewhat differently from fold operations implemented for non-distributed
* collections in functional languages like Scala. This fold operation may be applied to
* partitions individually, and then fold those results into the final result, rather than
* apply the fold to each element sequentially in some defined ordering. For functions
* that are not commutative, the result may differ from that of a fold applied to a
* non-distributed collection.
*
* @param zeroValue the initial value for the accumulated result of each partition for the `op`
* operator, and also the initial value for the combine results from different
* partitions for the `op` operator - this will typically be the neutral
* element (e.g. `Nil` for list concatenation or `0` for summation)
* @param op an operator used to both accumulate results within a partition and combine results
* from different partitions
*/
def fold(zeroValue: T)(op: (T, T) => T): T = withScope {
// Clone the zero value since we will also be serializing it as part of tasks
var jobResult = Utils.clone(zeroValue, sc.env.closureSerializer.newInstance())
val cleanOp = sc.clean(op)
val foldPartition = (iter: Iterator[T]) => iter.fold(zeroValue)(cleanOp)
val mergeResult = (index: Int, taskResult: T) => jobResult = op(jobResult, taskResult)
sc.runJob(this, foldPartition, mergeResult)
jobResult }import org.apache.spark.{HashPartitioner, SparkConf, SparkContext}object TestSpark {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("Chapter5_2_1_8")
val sc = new SparkContext(conf)
val rddData1 = sc.parallelize(Array(5, 5, 15, 15), 2)
val result = rddData1.fold(1)((x, y) => x + y)
println(result)
sc.stop()
}}foreach
遍历RDD中每一个函数依次应用f函数,源码
// Actions (launch a job to return a value to the user program)
/**
* Applies a function f to all elements of this RDD.
*/
def foreach(f: T => Unit): Unit = withScope {
val cleanF = sc.clean(f)
sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF))
}代码
import org.apache.spark.{HashPartitioner, SparkConf, SparkContext}object TestSpark {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("Chapter5_2_1_8")
val sc = new SparkContext(conf)
val rddData1 = sc.parallelize(Array(5, 5, 15, 15), 2)
rddData1.foreach(println)
sc.stop()
}}进入spark webUI
Running Application => application ID => Executor Summary => Logs => stdout stderr
foreachPartition
按分区依次遍历分区中的数据
源码
/**
* Applies a function f to each partition of this RDD.
*/
def foreachPartition(f: Iterator[T] => Unit): Unit = withScope {
val cleanF = sc.clean(f)
sc.runJob(this, (iter: Iterator[T]) => cleanF(iter))
}代码
import org.apache.spark.{HashPartitioner, SparkConf, SparkContext}object TestSpark {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("Chapter5_2_1_8")
val sc = new SparkContext(conf)
val rddData1 = sc.parallelize(Array(5, 5, 15, 15), 2)
rddData1.foreachPartition(iter => {
while (iter.hasNext) {
val ele = iter.next()
println(ele)
}
})
sc.stop()
}}count
源码
/** * Return the number of elements in the RDD. */ def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
返回RDD中元素的个数,返回类型Long
键值对操作
lookup
指定RDD中元素K的值,返回该RDD中K对应所有元素V值,封装到Seq序列集合中
/**
* Return the list of values in the RDD for key `key`. This operation is done efficiently if the
* RDD has a known partitioner by only searching the partition that the key maps to.
*/
def lookup(key: K): Seq[V] = self.withScope {
self.partitioner match {
case Some(p) =>
val index = p.getPartition(key)
val process = (it: Iterator[(K, V)]) => {
val buf = new ArrayBuffer[V]
for (pair <- it if pair._1 == key) {
buf += pair._2 }
buf } : Seq[V]
val res = self.context.runJob(self, process, Array(index))
res(0)
case None =>
self.filter(_._1 == key).map(_._2).collect()
}
}countByKey
统计key的出现次数,本质是调用maoValues,reduceByKey和collect 最终转换为scala中Map集合
/**
* Count the number of elements for each key, collecting the results to a local Map.
*
* @note This method should only be used if the resulting map is expected to be small, as
* the whole thing is loaded into the driver's memory.
* To handle very large results, consider using rdd.mapValues(_ => 1L).reduceByKey(_ + _), which
* returns an RDD[T, Long] instead of a map.
*/
def countByKey(): Map[K, Long] = self.withScope {
self.mapValues(_ => 1L).reduceByKey(_ + _).collect().toMap }返回Map(K,出现次数)
数值行动操作
| 数值操作 | 说明 |
|---|---|
| max | RDD中元素最大值 |
| min | RDD中元素最小值 |
| sum | RDD中元素求和 |
| mean | RDD中元素平均值 |
| variance | RDD中元素方差 |
| sampleVariance | RDD中元素抽样方差 |
| stdev | RDD中标准差 |
| sampleStdev | RDD中抽样标准差 |
import org.apache.spark.{HashPartitioner, SparkConf, SparkContext}object WriteSpark {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[*]").setAppName("WriteSpark")
val sc = new SparkContext(conf)
val realDate1 = sc.parallelize(1 to 10)
println(realDate1.sum())
sc.stop()
}}55.0
















