RDD支持两种操作:转换(transformation)从现有的数据集创建一个新的数据集;而动作(actions)在数据集上运行计算后,返回一个值给驱动程序。 区别是tranformation输入RDD,输出RDD,而action输入RDD,输出非RDD。transformation是缓释执行的,action是即刻执行的。例如,df1.map就是一种转换,它在使用时,并没有被调用,只有和df1相关的action发生时,df1才会被加载到内存,及时前面df1被加载过,若没有persist或cache,也需要重新加载。reduce是一种action,通过一些函数将所有的元素叠加起来,并将最终结果返回给Driver程序。(不过还有一个并行的reduceByKey,能返回一个分布式数据集)



下表列出了Spark中的RDD转换和动作(Spark 1.5.1)。每个操作都给出了标识,其中方括号表示类型参数。前面说过转换是延迟操作,用于定义新的RDD;而动作启动计算操作,并向用户程序返回值或向外部存储写数据。




map[U: ClassTag](f: T => U): RDD[U]


 flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U]


 filter(f: T => Boolean): RDD[T]


 distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T]


repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T]


 coalesce(numPartitions: Int, shuffle: Boolean = false, partitionCoalescer: Option[PartitionCoalescer] = Option.empty) (implicit ord: Ordering[T] = null): RDD[T]


 sample(withReplacement: Boolean,fraction: Double, seed: Long = Utils.random.nextLong): RDD[T]

 返还当前RDD子集元素的RDD。withReplacement:true 选取元素可重复,fraction决定每个元素重复次数(>=0);withReplacement:false 选取元素不可重复,fraction决定每个元素被选取的概率([0,1])

 randomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]]


 takeSample(withReplacement: Boolean,num: Int, seed: Long = Utils.random.nextLong): Array[T]


 union(other: RDD[T]): RDD[T]


 ++(other: RDD[T]): RDD[T]


 sortBy[K](f: (T) => K,ascending: Boolean = true, numPartitions: Int = this.partitions.length) (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T]


 intersection(other: RDD[T]): RDD[T]


intersection(other: RDD[T],partitioner: Partitioner) (implicit ord: Ordering[T] = null): RDD[T]


 intersection(other: RDD[T], numPartitions: Int): RDD[T]


 glom(): RDD[Array[T]]


 cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)]

 该函数返回的是Pair类型的RDD,计算结果是当前RDD和other RDD中每个元素进行笛卡儿计算的结果

 groupBy[K](f: T => K)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])]

 返回分组RDD,每个组包含一个key,以及一列映射到这个key的元素,每次调用每组的元素顺序是无法保证的。注意:此方法非常耗费资源,若仅是为了配合agg函数(如cont,sum等),推荐使用更高效的`PairRDDFunctions.aggregateByKey`或 `PairRDDFunctions.reduceByKey`

 groupBy[K](f: T => K,numPartitions: Int)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])]


 groupBy[K](f: T => K, p: Partitioner)(implicit kt: ClassTag[K], ord: Ordering[K]


 pipe(command: String): RDD[String]

 准备好外部程序A,rdd.pipe(A) 将会把rdd中每个元素做为A的输入,然后输出组成一个新的RDD。有点像map

 pipe(command: String, env: Map[String, String]): RDD[String]


 pipe(command: Seq[String],env: Map[String, String] = Map(), printPipeContext: (String => Unit) => Unit = null, printRDDElement: (T, String => Unit) => Unit = null, separateWorkingDir: Boolean = false,bufferSize: Int = 8192, encoding: String = Codec.defaultCharsetCodec.name): RDD[String]


 mapPartitions[U: ClassTag](f: Iterator[T] => Iterator[U], preservesPartitioning: Boolean = false): RDD[U]

 按partition抽取数据并通过函数f,详见该函数与map函数的区别,这里略。pre..Par..g参数只有在pair RDD时 且输入函数不改变键值时才为true

 mapPartitionsWithIndex[U: ClassTag]( f: (Int, Iterator[T]) => Iterator[U], preservesPartitioning : Boolean = false): RDD[U]

 同上,同时保持对原partition index的跟踪

 zip[U: ClassTag](other: RDD[U]): RDD[(T, U)]

按元素zip两个RDD形成key-value RDD, 两RDD需要有相同数量的分区且每分区有相同数量的元素(最方便的是一个rdd由另一个map而来)

 zipPartitions[B: ClassTag, V: ClassTag] (rdd2: RDD[B], preservesPartitioning: Boolean) (f: (Iterator[T], Iterator[B]) => Iterator[V]): RDD[V]

 按分区zip rdd, 要求两rdd有相同的分区数,不要求每分区数有相同元素,同时将zip后的rdd应用于函数f,得到新的RDD。

  zipPartitions[B: ClassTag, V: ClassTag] (rdd2: RDD[B], (f: (Iterator[T], Iterator[B]) => Iterator[V]): RDD[V]


  def zipPartitions[B: ClassTag, C: ClassTag, D: ClassTag, V: ClassTag] (rdd2: RDD[B], rdd3: RDD[C], rdd4: RDD[D]) (f: (Iterator[T], Iterator[B], Iterator[C], Iterator[D]) => Iterator[V]): RDD[V]




A&C函数: 满足结合律和交换律的函数

Associative :a + (b + c) = (a + b) + c , f(a,f(b,c))=f(f(a,b),c) if f(a,b)=a+b
Commutative:ab=ba ,f(a,b)=f(b,a) if f(a,b)=a*b

  foreach(f: T => Unit): Unit


  foreachPartition(f: Iterator[T] => Unit): Unit


  collect(): Array[T]


  toLocalIterator: Iterator[T]

 返回包含所有元素的iterator,占用内存为rdd最大分区内存 注意:使用此动作前rdd最好先persist

 collect[U: ClassTag](f: PartialFunction[T, U]): RDD[U]


 subtract(other: RDD[T]): RDD[T]

 返回在调用rdd中但不在other rdd中的元素组成的rdd,采用调用rdd的分区

 subtract(other: RDD[T], numPartitions: Int): RDD[T]


 subtract( other: RDD[T],p: Partitioner)(implicit ord: Ordering[T] = null): RDD[T]


 reduce(f: (T, T) => T): T


 treeReduce(f: (T, T) => T, depth: Int = 2): T

 以多层tree模式进行reduce, 可用来减少reduce开销,f需是A&C函数

 fold(zeroValue: T)(op: (T, T) => T): T


 aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U

 在每个分区内做聚合,再将分区内结果做聚合。可变换类型 即rdd[T]=>U,seqOp用于分区内运算,可做类型变换,combOp用于分区结果合并,需满足结合律A

 treeAggregate[U: ClassTag](zeroValue: U)( seqOp: (U, T) => U, combOp: (U, U) => U, depth: Int = 2): U


 count(): Long


 countApprox(timeout: Long,confidence: Double = 0.95) : PartialResult[BoundedDouble]

 在最长等待时间timeout 毫秒内返回置信度为confidence的近似结果

 countByValue()(implicit ord: Ordering[T] = null): Map[T, Long]

 计算每个元素出现的次数,返回Map到driver端。对于很大的rdd建议用rdd.map(x => (x, 1L)).reduceByKey(_ + _) 得到RDD[T,Long]而不是本地Map

 countByValueApprox(timeout: Long, confidence: Double = 0.95) (implicit ord: Ordering[T] = null) : PartialResult[Map[T, BoundedDouble]]


 countApproxDistinct(p: Int, sp: Int): Long


 countApproxDistinct(relativeSD: Double = 0.05): Long


 zipWithIndex(): RDD[(T, Long)]

 zip rdd的元素和相应元素编号,编号顺序先按分区再按分区内元素顺序。第一个分区第一个元素编号为0, 最后一分区最后一元素编号最大。如需顺序,需要用sortByKey保证

 zipWithUniqueId(): RDD[(T, Long)]

 zip rdd的元素和相应的独立id, 独立id比起编号来说是有间隙的,如需顺序,需要用sortByKey保证

 take(num: Int): Array[T]


 first(): T


 top(num: Int)(implicit ord: Ordering[T]): Array[T]

 返回排序后的前num个元素组成的数组到driver端,默认降序排列,sc.parallelize(Seq(2, 3, 4, 5, 6)).top(2) 返回Array(6, 5)

 takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T]

 同上,顺序相反,sc.parallelize(Seq(2, 3, 4, 5, 6)).takeOrdered(2)返回Array(2, 3)

 max()(implicit ord: Ordering[T]): T


 min()(implicit ord: Ordering[T]): T


 isEmpty(): Boolean

 判断rdd是否为空(空分区或空元素都为空,即使分区有一个,元素为空也为空)。注意:为Nothing或null的RDD引用会抛出异常。 `parallelize(Seq())` 为 `RDD[Nothing]`, (`parallelize(Seq())` 可通过 `parallelize(Seq[T]())`.)避免

 saveAsTextFile(path: String): Unit


 saveAsTextFile(path: String, codec: Class[_ <: CompressionCodec]): Unit


 saveAsObjectFile(path: String): Unit


 keyBy[K](f: T => K): RDD[(K, T)]

 通过f生成key-value rdd

 private[spark] def collectPartitions(): Array[Array[T]]


 checkpoint(): Unit = RDDCheckpointData.synchronized


 localCheckpoint(): this.type = RDDCheckpointData.synchronized


 isCheckpointed: Boolean


private[rdd] def isLocallyCheckpointed: Boolean


 getCheckpointFile: Option[String]

 获取rdd checkpointed的路径名称




sc.parallelize(Array(2., 3.)).fold(0.0)((p, v) => p+v*v)

由于(p, v) => p+v*v不满足交换律,所以结果不可知,可改为:

sc.parallelize(Array(2., 3.)).map(v=>v*v).reduce(_+_)



class BoundedDouble(val mean: Double, val confidence: Double, val low: Double, val high: Double) {
  override def toString(): String = "[%.3f, %.3f]".format(low, high)







  • Persisting or caching with StorageLevel.DISK_ONLY cause the generation of RDD to be computed and stored in a location such that subsequent use of that RDD will not go beyond that points in recomputing the linage.
  • After persist is called, Spark still remembers the lineage of the RDD even though it doesn't call it.
  • Secondly, after the application terminates, the cache is cleared or file destroyed


  • Checkpointing stores the rdd physically to hdfs and destroys the lineage that created it.
  • The checkpoint file won't be deleted even after the Spark application terminated.
  • Checkpoint files can be used in subsequent job run or driver program
  • Checkpointing an RDD causes double computation because the operation will first call a cache before doing the actual job of computing and writing to the checkpoint directory.


PairRDD 函数


