spark flatMap算子实例 spark算子大全

转载

mob6454cc786d85 2024-08-30 14:37:27

文章标签 spark flatMap算子实例 List 数据集 spark 文章分类 Spark 大数据

spark官方常用的32个算子

spark flatMap算子实例 spark算子大全_数据集

spark常用的Transformation

import org.apache.spark.{HashPartitioner, SparkConf, SparkContext}

import scala.collection.mutable.ListBuffer

object Transformation {
  val conf = new SparkConf().setAppName("Transformation").setMaster("local")
  val sc = new SparkContext(conf)

  def main(args: Array[String]): Unit = {
//    map()
//    filter()
//    flatMap()
//    mapPartitions()
//    mapPartitionsWithIndex()
//    sample()
//    union()
//    intersection()
//    distinct()
//    groupByKey()
//    reduceByKey()
//    aggregateByKey()
//    sortByKey()
//    join()
//    cogroup()
//    cartesian()
//    coalesce()
//    repartition()
    repartitionAndSortWithinPartitions()
  }

  def map()={
    sc.parallelize(1 to 10, 3)
      .map((_ + 1))
      .foreach(println)
  }

  /**
    * filter（function）
    * 过滤操作，满足filter内function函数为true的RDD内所有元素组成一个新的数据集
    */
  def filter(): Unit ={
    sc.parallelize(1 to 10, 1)
      .filter(it => it % 2 == 0)
      .foreach(println)
  }

  /**
    * flatMap（function）
    * map是对RDD中元素逐一进行函数操作映射为另外一个RDD，而flatMap操作是将函数应用于RDD之中
    * 的每一个元素，将返回的迭代器的所有内容构成新的RDD。而flatMap操作是将函数应用于RDD中每一
    * 个元素，将返回的迭代器的所有内容构成RDD。
    *
    * flatMap与map区别在于map为“映射”，而flatMap“先映射，后扁平化”，map对每一次（func）都产生
    * 一个元素，返回一个对象，而flatMap多一步就是将所有对象合并为一个对象。
    */
  def flatMap(): Unit ={
    sc.parallelize(1 to 5, 1)
      .flatMap(( _ to 5))
      .foreach(println)
  }

  /**
    * mapPartitions（function）
    * 区于foreachPartition（属于Action，且无返回值），而mapPartitions可获取返回值。
    * 与map的区别前面已经提到过了，但由于单独运行于RDD的每个分区上（block），
    * 所以在一个类型为T的RDD上运行时，（function）必须是
    * Iterator<T> => Iterator<U>类型的方法（入参）。
    */
  def mapPartitions(): Unit ={
    sc.parallelize(1 to 10, 3)
      .mapPartitions(it => {for(e <- it) yield e * 2})
      .foreach(println)
  }

  /**
    * mapPartitionsWithIndex（function）
    * 与mapPartitions类似，但需要提供一个表示分区索引值的整型值作为参数，因此function必
    * 须是（int， Iterator<T>）=>Iterator<U>类型的。
    */
  def mapPartitionsWithIndex(): Unit ={
    sc.parallelize(1 to 10, 3)
      .mapPartitionsWithIndex((x, it) =>{
        val res = List[String]()
        var i =0
        while (it.hasNext){
          i += it.next()
        }
        res.::(x + ":" + i).iterator
      })
      .foreach(println)
  }

  /**
    * sample（withReplacement， fraction， seed）
    * withReplacement是否放回，fraction采样比例，seed用于指定的随机数生成器的种子
    * 是否放回抽样分true和false，fraction取样比例为(0, 1]。seed种子为整型实数。
    */
  def sample(): Unit ={
    sc.parallelize(1 to 10, 3)
      .sample(false, 0.5, 1)
      .foreach(println)
  }

  /**
    * union（otherDataSet）
    * 对于源数据集和其他数据集求并集，不去重
    */
  def union(): Unit ={
    val data = sc.parallelize(1 to 10, 3)
    sc.parallelize(1 to 10, 3)
      .union(data)
      .foreach(println)
  }

  /**
    * intersection（otherDataSet）
    * 对于源数据集和其他数据集求交集，并去重，且无序返回。
    */
  def intersection(): Unit ={
    val data = sc.parallelize(6 to 12, 1)
    sc.parallelize(1 to 10, 1)
      .intersection(data)
      .foreach(println)

  }

  /**
    * distinct（[numTasks]）
    * 返回一个在源数据集去重之后的新数据集，即去重，并局部无序而整体有序返回。
    */
  def distinct(): Unit ={
    sc.parallelize(List(1,3,5,7,12,34,1,67,12,3,5,12,12,12,7,7,7,7), 2)
      .distinct()
      .collect()
      .foreach(println)
  }

  /**
    * groupByKey([numTasks])
    * groupByKey是将PairRDD中拥有相同key值得元素归为一组
    */
  def groupByKey(): Unit ={
    sc.parallelize(List(("武当", "张三丰"), ("峨眉", "灭绝师太"), ("武当", "张无忌"), ("峨眉", "周芷若")))
      .groupByKey()
      .foreach(println)
  }

  /**
    * reduceByKey（function，[numTasks]）
    * reduceByKey仅将RDD中所有K,V对中K值相同的V进行合并。
    */
  def reduceByKey(): Unit ={
    sc.parallelize(List(("武当", 99), ("少林", 97), ("武当", 89), ("少林", 77)))
      .reduceByKey(_+_)
      .foreach(println)
  }

  /**
    * aggregateByKey（zeroValue）（seqOp， combOp， [numTasks]）
    * aggregateByKey函数对PairRDD中相同Key的值进行聚合操作，在聚合过程中同样使用了一个中立的初始值。
    * 和aggregate函数类似，aggregateByKey返回值的类型不需要和RDD中value的类型一致。因为aggregateByKey
    * 是对相同Key中的值进行聚合操作，所以aggregateByKey函数最终返回的类型还是Pair RDD，对应的结果是Key和
    * 聚合好的值；而aggregate函数直接是返回非RDD的结果，这点需要注意。在实现过程中，定义了三个aggregateByKey
    * 函数原型，但最终调用的aggregateByKey函数都一致。
    * 或者
    * 类似reduceByKey，对pairRDD中想用的key值进行聚合操作，使用初始值（seqOp中使用，而combOpenCL中未使用）
    * 对应返回值为pairRDD，而区于aggregate（返回值为非RDD）
    */
  def aggregateByKey(): Unit ={
    sc.parallelize(List("hello world!", "I am a dog", "hello world!", "I am a dog"))
      .flatMap(_.split(" "))
      .map(( _, 1))
      .aggregateByKey(0)(_+_,_+_)
      .foreach(tuple =>println(tuple._1+"->"+tuple._2))
  }

  /**
    * sortByKey（[ascending], [numTasks]）
    *
    */
  def sortByKey(): Unit ={
    sc.parallelize(List((99, "张三丰"), (96, "东方不败"), (66, "林平之"), (98, "聂风")))
      .sortByKey(false)
      .foreach(tuple => println(tuple._2 + "->" + tuple._1))
  }

  /**
    * join（otherDataSet，[numTasks]）
    * 加入一个RDD，在一个（k，v）和（k，w）类型的dataSet上调用，返回一个（k，（v，w））的pair dataSet。
    */
  def join(): Unit ={
    val list1RDD = sc.parallelize(List((1, "东方不败"), (2, "令狐冲"), (3, "林平之")))
    val list2RDD = sc.parallelize(List((1, 99), (2, 98), (3, 97)))
    list1RDD.join(list2RDD)
      .foreach(println)
  }

  /**
    * cogroup（otherDataSet，[numTasks]）
    * 对两个RDD中的KV元素，每个RDD中相同key中的元素分别聚合成一个集合。与reduceByKey不同的是针对
    * 两个RDD中相同的key的元素进行合并。
    * 或者
    * 合并两个RDD，生成一个新的RDD。实例中包含两个Iterable值，第一个表示RDD1中相同值，第二个表示RDD2
    * 中相同值（key值），这个操作需要通过partitioner进行重新分区，因此需要执行一次shuffle操作。（
    * 若两个RDD在此之前进行过shuffle，则不需要）
    */

  def cogroup(): Unit ={
    val list1RDD = sc.parallelize(List((1, "www"), (2, "bbs")))
    val list2RDD = sc.parallelize(List((1, "cnblog"), (2, "cnblog"), (3, "very")))
    val list3RDD = sc.parallelize(List((1, "com"), (2, "com"), (3, "good")))

    list1RDD.cogroup(list2RDD,list3RDD)
      .foreach(println)
  }

  /**
    * cartesian（otherDataSet）
    * 求笛卡尔乘积。该操作不会执行shuffle操作。
    *
    */
  def cartesian(): Unit ={
    val list1RDD = sc.parallelize(List("A","B"))
    val list2RDD = sc.parallelize(List(1,2,3))
    list1RDD.cartesian(list2RDD)
      .foreach(println)
  }

  def pipe(): Unit ={

  }

  /**
    * coalesce（numPartitions）
    * 重新分区，减少RDD中分区的数量到numPartitions。
    *
    */
  def coalesce(): Unit ={
    sc.parallelize(List(1,2,3,4,5,6,7,8,9),3)
      .coalesce(1)
      .foreach(println)
  }

  /**
    * repartition（numPartitions）
    * repartition是coalesce接口中shuffle为true的简易实现，即Reshuffle RDD并随机分区，使各分区数据量
    * 尽可能平衡。若分区之后分区数远大于原分区数，则需要shuffle。
    */
  def repartition(): Unit ={
    sc.parallelize(List(1,2,3,4),1)
      .repartition(2)
      .foreach(println)
  }

  /**
    *.repartitionAndSortWithinPartitions（partitioner）
    * repartitionAndSortWithinPartitions函数是repartition函数的变种，与repartition函数不同的是，
    * repartitionAndSortWithinPartitions在给定的partitioner内部进行排序，性能比repartition要高。
    */
  def repartitionAndSortWithinPartitions(){
    val listRDD = sc.parallelize(List(1, 4, 55, 66, 33, 48, 23),1)
    listRDD.map(num => (num,num))
      .repartitionAndSortWithinPartitions(new HashPartitioner(2))
      .mapPartitionsWithIndex((index,iterator) => {
        val listBuffer: ListBuffer[String] = new ListBuffer
        while (iterator.hasNext) {
          listBuffer.append(index + "_" + iterator.next())
        }
        listBuffer.iterator
      },false)
      .foreach(println)
  }

}

spark常用的Action

import org.apache.spark.{SparkConf, SparkContext}

object Action {

  val conf = new SparkConf().setAppName("Action").setMaster("local")
  val sc = new SparkContext(conf)

  def main(args: Array[String]): Unit = {
    //    reduce()
    //    collect()
    //    count()
    //    first()
    //    takeSample()
    //    take()
    //    takeOrdered()
    //    saveAsTextFile()
    //    saveAsSequenceFile()
    countByKey()
  }
  
  /**
    * reduce（function）
    * reduce其实是将RDD中的所有元素进行合并，当运行call方法时，会传入两个参数，
    * 在call方法中将两个参数合并后返回，而这个返回值回合一个新的RDD中的元素再次传入call方法中，
    * 继续合并，直到合并到只剩下一个元素时。
    */
  def reduce(): Unit ={
    val result = sc.parallelize(List(1,2,3,4,5,6))
      .reduce((x,y) => x+y)
    println(result)

  }

  /**
    * count（）
    * 将一个RDD以一个Array数组形式返回其中的所有元素。
    */
  def collect(): Unit ={
    val array = sc.parallelize(1 to 10, 2)
      .collect()
    array.foreach(println)
  }

  /**
    * count（）
    * 返回数据集中元素个数，默认Long类型。
    */
  def count(): Unit ={
    println(sc.parallelize(1 to 10, 2).count())
  }

  /**
    * first（）
    * 返回数据集的第一个元素（类似于take(1)）
    */
  def first(): Unit ={
    println(sc.parallelize(1 to 10, 2).first())
  }

  /**
    * takeSample（withReplacement， num， [seed]）
    * 对于一个数据集进行随机抽样，返回一个包含num个随机抽样元素的数组，withReplacement表示
    * 是否有放回抽样，参数seed指定生成随机数的种子。
    * 该方法仅在预期结果数组很小的情况下使用，因为所有数据都被加载到driver端的内存中。
    */
  def takeSample(): Unit ={
    val array = sc.parallelize(1 to 10, 2).takeSample(true,3,1)
    array.foreach(println)
  }

  /**
    * take（n）
    * 返回一个包含数据集前n个元素的数组（从0下标到n-1下标的元素），不排序。
    */
  def take(): Unit ={
    val array = sc.parallelize(List(2,7,1,8,3),2).take(3)
    array.foreach(println)
  }

  /**
    * takeOrdered（n，[ordering]）
    * 返回RDD中前n个元素，并按默认顺序排序（升序）或者按自定义比较器顺序排序。
    */
  def takeOrdered(): Unit ={
    val array = sc.parallelize(List(2,7,1,8,3),2).takeOrdered(3)
    array.foreach(println)
  }

  /**
    * saveAsTextFile（path）
    * 将dataSet中元素以文本文件的形式写入本地文件系统或者HDFS等。Spark将对每个元素调用toString方法，
    * 将数据元素转换为文本文件中的一行记录。
    * 若将文件保存到本地文件系统，那么只会保存在executor所在机器的本地目录。
    */
  def saveAsTextFile(): Unit ={
    sc.parallelize(List(2,7,1,8,3),2).saveAsTextFile("E:\\data\\")
  }

  /**
    * saveAsSequenceFile（path）（Java and Scala）
    * 将dataSet中元素以Hadoop SequenceFile的形式写入本地文件系统或者HDFS等。（对pairRDD操作）
    */
  def saveAsSequenceFile(): Unit ={
    sc.parallelize(List(2,7,1,8,3),2).saveAsObjectFile("E:\\data\\")
  }

  /**
    * countByKey（）
    * 用于统计RDD[K,V]中每个K的数量，返回具有每个key的计数的（k，int）pairs的hashMap。
    */
  def countByKey(): Unit ={
    val map = Map("qq" -> "11", "ww" -> "22", "ee" -> "33", "rr" -> "44", "5" -> "55", "6" -> "66")
    sc.parallelize(map.toList, 2)
      .countByKey()
      .foreach(println)
  }

  /**
    * foreach（function）
    */
  def foreach(): Unit ={

  }
}

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。