美图欣赏:
一.背景
spark中一共有四种重分区算子:
1.repartition
2.coalesce
3.partitionBy
4.repartitionAndSortWithinPartitions
二.spark中map,mapPartitions,mapPartitionsWithIndex,sortBy ,sortByKey 算子
1.创建一个集合(默认设置是俩分区)
在这里分区数给的是3
scala> val rdd1 = sc.parallelize(List(1,2,3,4,5,6),3)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:24
2.查看分区数
partitions.length
scala> rdd1.partitions.length
res0: Int = 3
3.map 和 mapPartitions对比
map是将func作用于每一个元素上,而mapPartitions是将func作用于每个分区上,应用场景为:如果该RDD数据集的数据不是很多的情况下,可以用map处理,如果数据比较多,可以用mapPartitions,可以提高计算效率,如果数据量很大,有可能会导致oom
//第一种方法
//map 进行对rdd1的元素乘以10
scala> rdd1.map(_*10).collect
res0: Array[Int] = Array(10, 20, 30, 40, 50, 60)
//第二种方法
scala> rdd1.mapPartitions
mapPartitions mapPartitionsWithIndex
scala> rdd1.mapPartitions(_.map(_*10)).collect
res1: Array[Int] = Array(10, 20, 30, 40, 50, 60)
4.mapPartitionsWithIndex可以按分区,查看里面的数据
scala> val Iter = (index:Int,iter:Iterator[(Int)]) =>{
| iter.map(x => "[partID:"+index+ ", value: "+x+"]")
| }
Iter: (Int, Iterator[Int]) => Iterator[String] = <function2>
scala> rdd1.mapPartitionsWithIndex(Iter).collect
res5: Array[String] = Array([partID:0, value: 1], [partID:0, value: 2], [partID:1, value: 3], [partID:1, value: 4], [partID:2, value: 5], [partID:2, value: 6])
5.排序俩种方法 sortBy 和 sortByKey
(1)创建一个数组集合
scala> val rdd2 = sc.parallelize(Array((3,"aa"),(6,"cc"),(2,"bb"),(1,"dd")))
rdd2: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[5] at parallelize at <console>:24
//第一种方法sortByKey
scala> rdd2.sortByKey().collect
res6: Array[(Int, String)] = Array((1,dd), (2,bb), (3,aa), (6,cc))
//默认是升序排序,false后变成降序排序了
scala> rdd2.sortByKey(false).collect
res7: Array[(Int, String)] = Array((6,cc), (3,aa), (2,bb), (1,dd))
//第二种方法 sortBy , x表示拿到里面的元素, _1表示拿到里面元素的第一个值
scala> rdd2.sortBy(x => x._1,false).collect
res8: Array[(Int, String)] = Array((6,cc), (3,aa), (2,bb), (1,dd))
三.重分区算子
创建一个List集合
scala> val rdd3 = sc.parallelize(List(1,2,3,4,5,6),3)
rdd3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[17] at parallelize at <console>:24
1.repartition 重分区算子
repartition 是shuffle = true ,进行重分区
源码:
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
coalesce(numPartitions, shuffle = true)
}
scala> rdd3.repartition(6)
res9: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[21] at repartition at <console>:27
//原有的分区数,没有改变
scala> rdd3.partitions.length
res10: Int = 3
//现在的分区数,已经改变了
scala> res9.partitions.length
res11: Int = 6
//设置分区数为40,增大分区数
scala> rdd3.repartition(40)
res12: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[25] at repartition at <console>:27
//设置分区数为1,减小分区数
scala> rdd3.repartition(1)
res14: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[29] at repartition at <console>:27
//查看减小的分区数
scala> res14.partitions.length
res15: Int = 1
2.coalesce重分区算子
scala> val rdd4 = rdd3.coalesce(8)
rdd4: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[30] at coalesce at <console>:26
//1.即使改大,分区数也是不变
scala> rdd4.partitions.length
res16: Int = 3
//2.可以改小
scala> val rdd4 = rdd3.coalesce(1)
rdd4: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[31] at coalesce at <console>:26
scala> rdd4.partitions.length
res17: Int = 1
//3.可以改成原始的分区数
scala> val rdd4 = rdd3.coalesce(3)
rdd4: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[32] at coalesce at <console>:26
scala> rdd4.partitions.length
res18: Int = 3
//1.源码coalesce(numPartitions: Int, shuffle: Boolean = false,
def coalesce(numPartitions: Int, shuffle: Boolean = false,
partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
(implicit ord: Ordering[T] = null)
//2.分区里面的合并,并不会发生shuffle
//3.coalesce(4,true) 第二个参数设置成true就可改变了
scala> val rdd4 = rdd3.coalesce(4,true)
rdd4: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[37] at coalesce at <console>:26
//4.查看修改后的分区数
scala> rdd4.partitions.length
res20: Int = 4
3.partitionBy重分区算子
创建一个rdd5,为partitionBy重分区做准备
scala> val rdd5 = sc.parallelize(List(("e",5),("c",3),("d",4),("c",2),("a",1)),2)
rdd5: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[38] at parallelize at <console>:24
(1).partitionBy(源码) k,v形式,才可以从分区,可以自定义形式
def partitionBy(partitioner: Partitioner): JavaPairRDD[K, V] =
fromRDD(rdd.partitionBy(partitioner))
(2).partitionByf分区算子,底层还是要依赖HashPartitioner算子
//HashPartitioner源码:
class HashPartitioner(partitions: Int) extends Partitioner {
require(partitions >= 0, s"Number of partitions ($partitions) cannot be negative.")
(3).partitionBy重分区
//1.没有导入org.apache.spark ,无法进行HashPartition
scala> rdd5.partitionBy(new org.apache.spark.HashPartition(4))
<console>:27: error: type HashPartition is not a member of package org.apache.spark
rdd5.partitionBy(new org.apache.spark.HashPartition(4))
^
//2.org.apache.spark.HashPartitioner,可以正常进行分区
scala> rdd5.partitionBy(new org.apache.spark.HashPartitioner(4))
res23: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[39] at partitionBy at <console>:27
//3.查看重新分区数
scala> res23.partitions.length
res24: Int = 4
4.repartitionAndSortWithinPartitions重分区算子
repartitionAndSortWithinPartitions源码:
def repartitionAndSortWithinPartitions(partitioner: Partitioner): RDD[(K, V)] = self.withScope {
new ShuffledRDD[K, V, V](self, partitioner).setKeyOrdering(ordering)
}
repartitionAndSortWithinPartitions算子进行重分区
//repartitionAndSortWithinPartitions算子
scala> rdd6.repartitionAndSortWithinPartitions(new org.apache.spark.HashPartitioner(1))
res29: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[41] at repartitionAndSortWithinPartitions at <console>:27
scala> res29.collect
res32: Array[(String, Int)] = Array((a,1), (c,3), (c,2), (d,4), (e,5))
高级算子:
http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html
四.扩展点(并行度和分区数)
1.给几个分区数,就会生成几个(一般情况)
2.给几个并行度,不一定就是几个,一般需要拿到并行度,和cpu核数,核block大小,等结合一起
3.会发生shuffle的基本上都可以传参数,更改并行度
val sum: RDD[(String, Int)] = keyword.reduceByKey(_ + _,3)//3就是可以指定并行度,然后作用到分区上 val sorted:4 RDD[(String, Int)] = sum.sortBy((_._2), false,4)//4就是可以指定并行度,然后作用到分区上
手动调整并行度
val lines: RDD[String] = sc.textFile(“hdfs://bigdata111:9000/mrTest/wordcount.txt”,3) //3就是设置的并行度
————保持饥饿,保持学习
Jackson_MVP