美图欣赏:

spark中map函数java代码 spark的mappartition_spark中map函数java代码


一.背景

spark中一共有四种重分区算子:

1.repartition
2.coalesce
3.partitionBy
4.repartitionAndSortWithinPartitions

二.spark中map,mapPartitions,mapPartitionsWithIndex,sortBy ,sortByKey 算子

1.创建一个集合默认设置是俩分区

在这里分区数给的是3

scala> val rdd1 = sc.parallelize(List(1,2,3,4,5,6),3)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:24

2.查看分区数

partitions.length

scala> rdd1.partitions.length
res0: Int = 3

3.map 和 mapPartitions对比

map是将func作用于每一个元素上,而mapPartitions是将func作用于每个分区上,应用场景为:如果该RDD数据集的数据不是很多的情况下,可以用map处理,如果数据比较多,可以用mapPartitions,可以提高计算效率,如果数据量很大,有可能会导致oom

//第一种方法
//map 进行对rdd1的元素乘以10
scala> rdd1.map(_*10).collect
res0: Array[Int] = Array(10, 20, 30, 40, 50, 60)

//第二种方法
scala> rdd1.mapPartitions
mapPartitions   mapPartitionsWithIndex

scala> rdd1.mapPartitions(_.map(_*10)).collect
res1: Array[Int] = Array(10, 20, 30, 40, 50, 60)

4.mapPartitionsWithIndex可以按分区,查看里面的数据

scala> val Iter = (index:Int,iter:Iterator[(Int)]) =>{
     | iter.map(x => "[partID:"+index+ ", value: "+x+"]")
     | }
Iter: (Int, Iterator[Int]) => Iterator[String] = <function2>

scala> rdd1.mapPartitionsWithIndex(Iter).collect
res5: Array[String] = Array([partID:0, value: 1], [partID:0, value: 2], [partID:1, value: 3], [partID:1, value: 4], [partID:2, value: 5], [partID:2, value: 6])

5.排序俩种方法 sortBy 和 sortByKey

(1)创建一个数组集合

scala> val rdd2 = sc.parallelize(Array((3,"aa"),(6,"cc"),(2,"bb"),(1,"dd")))
rdd2: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[5] at parallelize at <console>:24
//第一种方法sortByKey
scala> rdd2.sortByKey().collect
res6: Array[(Int, String)] = Array((1,dd), (2,bb), (3,aa), (6,cc))

//默认是升序排序,false后变成降序排序了
scala> rdd2.sortByKey(false).collect
res7: Array[(Int, String)] = Array((6,cc), (3,aa), (2,bb), (1,dd))
//第二种方法 sortBy , x表示拿到里面的元素, _1表示拿到里面元素的第一个值
scala> rdd2.sortBy(x => x._1,false).collect
res8: Array[(Int, String)] = Array((6,cc), (3,aa), (2,bb), (1,dd))

三.重分区算子
创建一个List集合

scala> val rdd3 = sc.parallelize(List(1,2,3,4,5,6),3)
rdd3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[17] at parallelize at <console>:24

1.repartition 重分区算子

repartition 是shuffle = true ,进行重分区

源码:
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
coalesce(numPartitions, shuffle = true)
}

scala> rdd3.repartition(6)
res9: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[21] at repartition at <console>:27

//原有的分区数,没有改变
scala> rdd3.partitions.length
res10: Int = 3

//现在的分区数,已经改变了
scala> res9.partitions.length
res11: Int = 6

//设置分区数为40,增大分区数
scala> rdd3.repartition(40)
res12: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[25] at repartition at <console>:27

//设置分区数为1,减小分区数
scala> rdd3.repartition(1)
res14: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[29] at repartition at <console>:27

//查看减小的分区数
scala> res14.partitions.length
res15: Int = 1

2.coalesce重分区算子

scala> val rdd4 = rdd3.coalesce(8)
rdd4: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[30] at coalesce at <console>:26

//1.即使改大,分区数也是不变
scala> rdd4.partitions.length
res16: Int = 3

//2.可以改小
scala> val rdd4 = rdd3.coalesce(1)
rdd4: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[31] at coalesce at <console>:26

scala> rdd4.partitions.length
res17: Int = 1

//3.可以改成原始的分区数
scala> val rdd4 = rdd3.coalesce(3)
rdd4: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[32] at coalesce at <console>:26

scala> rdd4.partitions.length
res18: Int = 3
//1.源码coalesce(numPartitions: Int, shuffle: Boolean = false,  
def coalesce(numPartitions: Int, shuffle: Boolean = false,
               partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
              (implicit ord: Ordering[T] = null)  

//2.分区里面的合并,并不会发生shuffle

//3.coalesce(4,true) 第二个参数设置成true就可改变了
scala> val rdd4 = rdd3.coalesce(4,true)
rdd4: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[37] at coalesce at <console>:26

//4.查看修改后的分区数
scala> rdd4.partitions.length
res20: Int = 4

3.partitionBy重分区算子

创建一个rdd5,为partitionBy重分区做准备

scala> val rdd5 = sc.parallelize(List(("e",5),("c",3),("d",4),("c",2),("a",1)),2)
rdd5: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[38] at parallelize at <console>:24

(1).partitionBy(源码) k,v形式,才可以从分区,可以自定义形式

def partitionBy(partitioner: Partitioner): JavaPairRDD[K, V] =
    fromRDD(rdd.partitionBy(partitioner))

(2).partitionByf分区算子,底层还是要依赖HashPartitioner算子

//HashPartitioner源码:

   class HashPartitioner(partitions: Int) extends Partitioner {
  require(partitions >= 0, s"Number of partitions ($partitions) cannot be negative.")

(3).partitionBy重分区

//1.没有导入org.apache.spark ,无法进行HashPartition
scala> rdd5.partitionBy(new org.apache.spark.HashPartition(4))
<console>:27: error: type HashPartition is not a member of package org.apache.spark
rdd5.partitionBy(new org.apache.spark.HashPartition(4))
                                             ^
//2.org.apache.spark.HashPartitioner,可以正常进行分区
scala> rdd5.partitionBy(new org.apache.spark.HashPartitioner(4))
res23: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[39] at partitionBy at <console>:27

//3.查看重新分区数
scala> res23.partitions.length
res24: Int = 4

4.repartitionAndSortWithinPartitions重分区算子

repartitionAndSortWithinPartitions源码

def repartitionAndSortWithinPartitions(partitioner: Partitioner): RDD[(K, V)] = self.withScope {
    new ShuffledRDD[K, V, V](self, partitioner).setKeyOrdering(ordering)
  }

repartitionAndSortWithinPartitions算子进行重分区

//repartitionAndSortWithinPartitions算子
scala> rdd6.repartitionAndSortWithinPartitions(new org.apache.spark.HashPartitioner(1))
res29: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[41] at repartitionAndSortWithinPartitions at <console>:27

scala> res29.collect
res32: Array[(String, Int)] = Array((a,1), (c,3), (c,2), (d,4), (e,5))

高级算子:

http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html

四.扩展点(并行度和分区数)

1.给几个分区数,就会生成几个(一般情况)

2.给几个并行度,不一定就是几个,一般需要拿到并行度,和cpu核数,核block大小,等结合一起

3.会发生shuffle的基本上都可以传参数,更改并行度
val sum: RDD[(String, Int)] = keyword.reduceByKey(_ + _,3)//3就是可以指定并行度,然后作用到分区上 val sorted:

4 RDD[(String, Int)] = sum.sortBy((_._2), false,4)//4就是可以指定并行度,然后作用到分区上
手动调整并行度
val lines: RDD[String] = sc.textFile(“hdfs://bigdata111:9000/mrTest/wordcount.txt”,3) //3就是设置的并行度

————保持饥饿,保持学习
                Jackson_MVP