RDD转换算子之单Value类型
文章目录
- RDD转换算子之单Value类型
- 1. map(func)
- 2. mapPartitions(func)
- 3. mapPartitionsWithIndex(func)
- 4. flatMap(func)
- 5. glom
- 6. groupBy(func)
- 7. filter(func)
- 8. sample(withReplacement, fraction, seed)
- 9. distinct([numTasks])
- 10 coalesce(numPartitions, shuffle)
- 11.repartition(numPartitions)
- 12. sortBy(func, [ascending], [numTasks])
- 13. pipe(command, [envVars])
1. map(func)
- 作用: 返回一个新的
RDD,该RDD是由原RDD的每个元素经过函数转换之后的值组成。即,是对RDD中的数据做转换。 - 示例:
// 创建⼀个包含 1-10 的 RDD,然后将每个元素 *2 形成新的 RDD
scala> val rdd = sc.makeRDD(1 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[40] at makeRDD at <console>:24
scala> val newRdd = rdd.map(_*2)
newRdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[41] at map at <console>:26
scala> newRdd.collect
res31: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)2. mapPartitions(func)
- 作用: 类似于
map(func),但是是独立在每个分区上运行,所以func的类型是Iterator<T> => Iterator<U>。假设有N个元素,M个分区,那么map函数会被调用N次,而mapPartitions会被调用M次。 - 示例:
// 创建⼀个包含 1-10 的 RDD,然后将每个元素 *2 形成新的 RDD
scala> val rdd = sc.makeRDD(1 to 10, 4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[42] at makeRDD at <console>:24
scala> val newRdd = rdd.mapPartitions(par=>par.map(_*2))
newRdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[43] at mapPartitions at <console>:26
scala> newRdd.collect
res32: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)3. mapPartitionsWithIndex(func)
- 作用: 与
mapPartitions(func)类似,但是会给func多提供一个Int类型的分区号,所以func的类型是(Int, Iterator<T>) => Iterator<U>。 - 示例:
// 创建⼀个包含 1-10 的 RDD,然后得到 (分区号, 数据) 形式的新的 RDD
scala> val rdd = sc.makeRDD(1 to 10, 4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[44] at makeRDD at <console>:24
scala> val newRdd = rdd.mapPartitionsWithIndex((index, par) => par.map((index, _)))
newRdd: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[45] at mapPartitionsWithIndex at <console>:26
scala> newRdd.collect
res33: Array[(Int, Int)] = Array((0,1), (0,2), (1,3), (1,4), (1,5), (2,6), (2,7), (3,8), (3,9), (3,10))4. flatMap(func)
- 作用: 与
map(func)类似,但是每一个输入可以被映射成0或多个输出元素,所以func应该返回一个序列,而不是一个单一元素T => TraversableOnce[U]。 - 示例:
// 创建⼀个包含 1-10 的 RDD,然后得到一个由原 RDD 中每个元素的平方和三次方组成的新的 RDD
scala> val rdd = sc.makeRDD(1 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[46] at makeRDD at <console>:24
scala> val newRdd = rdd.flatMap(x => Array(x*x, x*x*x))
newRdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[48] at flatMap at <console>:26
scala> newRdd.collect
res35: Array[Int] = Array(1, 1, 4, 8, 9, 27, 16, 64, 25, 125, 36, 216, 49, 343, 64, 512, 81, 729, 100, 1000)5. glom
- 作用: 将每一个分区的元素合并成一个数组,形成新的
RDD类型是RDD[Array[T]]。 - 示例:
// 创建⼀个 4 个分区的 RDD,并将每个分区的数据放到⼀个数组
scala> val rdd = sc.makeRDD(1 to 10, 4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[49] at makeRDD at <console>:24
scala> val newRdd= rdd.glom
newRdd: org.apache.spark.rdd.RDD[Array[Int]] = MapPartitionsRDD[50] at glom at <console>:26
scala> newRdd.collect
res36: Array[Array[Int]] = Array(Array(1, 2), Array(3, 4, 5), Array(6, 7), Array(8, 9, 10))6. groupBy(func)
- 作用: 按照
func的返回值进行分组,func返回值作为key,对应的值放入一个迭代器中,返回的RDD类型是RDD[(K, Iterable[T])],每组内元素的顺序不能保证,并且甚至每次调用得到的顺序也有可能不同。 - 示例:
// 创建⼀个 RDD,按照元素的奇偶性进⾏分组
scala> val rdd = sc.makeRDD(1 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[51] at makeRDD at <console>:24
scala> val newRdd = rdd.groupBy(e => if(e % 2 == 0) "even" else "odd")
newRdd: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[53] at groupBy at <console>:26
scala> newRdd.collect
res37: Array[(String, Iterable[Int])] = Array((even,CompactBuffer(2, 4, 6, 8, 10)), (odd,CompactBuffer(1, 3, 5, 7, 9)))7. filter(func)
- 作用: 过滤,返回的新的
RDD是由func的返回值为true的那些元素组成。 - 示例:
// 创建⼀个 RDD(由字符串组成),过滤出⼀个新 RDD(包含"xiao"⼦串)
scala> val rdd = sc.makeRDD(Array("xiaozhu", "xiaozhang", "zhu", "zhang", "xiaohong"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[54] at makeRDD at <console>:24
scala> val newRdd = rdd.filter(_.contains("xiao"))
newRdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[55] at filter at <console>:26
scala> newRdd.collect
res38: Array[String] = Array(xiaozhu, xiaozhang, xiaohong)8. sample(withReplacement, fraction, seed)
- 作用:
1.1 以指定的随机种⼦随机抽样出⽐例为fraction的数据,(抽取到的数量是:size * fraction), 需要注意的是得到的结果并不能保证准确的⽐例。
1.2withReplacement表示是抽出的数据是否放回,true为有放回的抽样,false为⽆放回的抽样。 放回表示数据有可能会被重复抽取到,false则不可能重复抽取到。 如果是false, 则fraction必须是:[0,1], 是true则⼤于等于0就可以了。
1.3seed⽤于指定随机数⽣成器种⼦。 ⼀般⽤默认的, 或者传⼊当前的时间戳。 - 示例:
// 不放回抽样
scala> val rdd = sc.makeRDD(1 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[56] at makeRDD at <console>:24
scala> val newRdd = rdd.sample(false, 0.5)
newRdd: org.apache.spark.rdd.RDD[Int] = PartitionwiseSampledRDD[57] at sample at <console>:26
scala> newRdd.collect
res39: Array[Int] = Array(1, 2, 4, 5, 6, 7, 8, 9)
// 放回抽样
scala> newRdd.collect
res39: Array[Int] = Array(1, 2, 4, 5, 6, 7, 8, 9)
scala> val rdd = sc.makeRDD(1 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[58] at makeRDD at <console>:24
scala> val newRdd = rdd.sample(true, 1.5)
newRdd: org.apache.spark.rdd.RDD[Int] = PartitionwiseSampledRDD[59] at sample at <console>:26
scala> newRdd.collect
res40: Array[Int] = Array(1, 1, 1, 2, 5, 5, 7, 8, 8, 9, 9, 9, 9, 10)9. distinct([numTasks])
- 作用: 对
RDD中的元素执行去重操作,参数表示任务的数量,默认值和分区数保持一致。 - 示例:
// 对 RDD 的元素进行去重
scala> val rdd = sc.makeRDD(Array(1, 1, 1, 2, 5, 5, 7, 8, 8, 9, 9, 9, 9, 10))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[60] at makeRDD at <console>:24
scala> val newRdd = rdd.distinct
newRdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[63] at distinct at <console>:26
scala> newRdd.collect
res41: Array[Int] = Array(8, 1, 9, 5, 10, 2, 7)10 coalesce(numPartitions, shuffle)
- 作用: 缩减分区数到指定的数量,用于大数据集过滤后,提高小数据集的执行效率。第⼆个参数表示是否
shuffle, 如果不传或者传⼊的为false, 则表示不进⾏shuffer, 则分区数减少有效, 增加分区数⽆效。 - 示例:
// 将 4 个分区缩减为 2 个分区
scala> val rdd = sc.makeRDD(1 to 10, 4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[64] at makeRDD at <console>:24
scala> rdd.partitions.length
res42: Int = 4
scala> val newRdd = rdd.coalesce(2)
newRdd: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[65] at coalesce at <console>:26
scala> newRdd.partitions.length
res43: Int = 211.repartition(numPartitions)
- 作用: 根据新的分区数,重新
shuffle所有数据,这个操作总会通过网络,新的分区数相比之前可以多,也可以少。 - 示例:
// 将 2 个分区,扩充至 4 个分区
scala> val rdd = sc.makeRDD(1 to 10, 2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[66] at makeRDD at <console>:24
scala> rdd.partitions.length
res44: Int = 2
scala> val newRdd = rdd.repartition(4)
newRdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[70] at repartition at <console>:26
scala> newRdd.partitions.length
res45: Int = 412. sortBy(func, [ascending], [numTasks])
- 作用: 使用
func先对数据进行处理,按照处理后的数据比较结果排序,默认为正序。 - 示例:
scala> val rdd = sc.makeRDD(Array(4,6,9,1,4,8,5,9,0))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[71] at makeRDD at <console>:24
// 正序
scala> val newRdd = rdd.sortBy(x => x)
newRdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[76] at sortBy at <console>:26
scala> newRdd.collect
res46: Array[Int] = Array(0, 1, 4, 4, 5, 6, 8, 9, 9)
// 正序
scala> val newRdd = rdd.sortBy(x => x, true)
newRdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[81] at sortBy at <console>:26
scala> newRdd.collect
res47: Array[Int] = Array(0, 1, 4, 4, 5, 6, 8, 9, 9)
// 倒序
scala> val newRdd = rdd.sortBy(x => x, false)
newRdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[86] at sortBy at <console>:26
scala> newRdd.collect
res48: Array[Int] = Array(9, 9, 8, 6, 5, 4, 4, 1, 0)13. pipe(command, [envVars])
- 作用: 管道,针对每个分区,通过管道传递给
shell命令或脚本,返回输出的RDD。每个分区执行一次这个命令,如果只有一个分区,则执行一次命令。 - 示例:
2.1 创建一个脚本文件
echo "hello"
while read line; do
echo ">>>"$line
done2.2 只有一个分区的 RDD
scala> val rdd = sc.makeRDD(1 to 5, 1)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[91] at makeRDD at <console>:24
scala> val newRdd = rdd.pipe("./")
newRdd: org.apache.spark.rdd.RDD[String] = PipedRDD[92] at pipe at <console>:26
scala> newRdd.collect
res51: Array[String] = Array(hello, >>>1, >>>2, >>>3, >>>4, >>>5)2.3 多个分区的 RDD
scala> val rdd = sc.makeRDD(1 to 5, 2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[93] at makeRDD at <console>:24
scala> val newRdd = rdd.pipe("./")
newRdd: org.apache.spark.rdd.RDD[String] = PipedRDD[94] at pipe at <console>:26
scala> newRdd.collect
res52: Array[String] = Array(hello, >>>1, >>>2, hello, >>>3, >>>4, >>>5)
















