sparksql转换百分数

转载

jkfox 2024-07-11 07:10:35

文章标签 sparksql转换百分数 spark 大数据 rdd 转换算子 文章分类 Spark 大数据

RDD转换算子之单Value类型

文章目录

RDD转换算子之单Value类型

1. map(func)
2. mapPartitions(func)
3. mapPartitionsWithIndex(func)
4. flatMap(func)
5. glom
6. groupBy(func)
7. filter(func)
8. sample(withReplacement, fraction, seed)
9. distinct([numTasks])
10 coalesce(numPartitions, shuffle)
11.repartition(numPartitions)
12. sortBy(func, [ascending], [numTasks])
13. pipe(command, [envVars])

1. map(func)

作用： 返回一个新的 RDD ，该 RDD 是由原 RDD 的每个元素经过函数转换之后的值组成。即，是对 RDD 中的数据做转换。
示例：

// 创建⼀个包含 1-10 的 RDD，然后将每个元素 *2 形成新的 RDD

scala> val rdd = sc.makeRDD(1 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[40] at makeRDD at <console>:24

scala> val newRdd = rdd.map(_*2)
newRdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[41] at map at <console>:26

scala> newRdd.collect
res31: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)

2. mapPartitions(func)

作用： 类似于 map(func) ，但是是独立在每个分区上运行，所以 func 的类型是 Iterator<T> => Iterator<U>。假设有 N 个元素，M 个分区，那么 map 函数会被调用 N 次，而 mapPartitions 会被调用 M 次。
示例：

// 创建⼀个包含 1-10 的 RDD，然后将每个元素 *2 形成新的 RDD

scala> val rdd = sc.makeRDD(1 to 10, 4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[42] at makeRDD at <console>:24

scala> val newRdd = rdd.mapPartitions(par=>par.map(_*2))
newRdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[43] at mapPartitions at <console>:26

scala> newRdd.collect
res32: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)

3. mapPartitionsWithIndex(func)

作用： 与 mapPartitions(func) 类似，但是会给 func 多提供一个 Int 类型的分区号，所以 func 的类型是 (Int, Iterator<T>) => Iterator<U>。
示例：

// 创建⼀个包含 1-10 的 RDD，然后得到 (分区号, 数据) 形式的新的 RDD

scala> val rdd = sc.makeRDD(1 to 10, 4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[44] at makeRDD at <console>:24


scala> val newRdd = rdd.mapPartitionsWithIndex((index, par) => par.map((index, _)))
newRdd: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[45] at mapPartitionsWithIndex at <console>:26

scala> newRdd.collect
res33: Array[(Int, Int)] = Array((0,1), (0,2), (1,3), (1,4), (1,5), (2,6), (2,7), (3,8), (3,9), (3,10))

4. flatMap(func)

作用： 与 map(func) 类似，但是每一个输入可以被映射成 0 或多个输出元素，所以 func 应该返回一个序列，而不是一个单一元素 T => TraversableOnce[U]。
示例：

// 创建⼀个包含 1-10 的 RDD，然后得到一个由原 RDD 中每个元素的平方和三次方组成的新的 RDD

scala> val rdd = sc.makeRDD(1 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[46] at makeRDD at <console>:24

scala> val newRdd = rdd.flatMap(x => Array(x*x, x*x*x))
newRdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[48] at flatMap at <console>:26

scala> newRdd.collect
res35: Array[Int] = Array(1, 1, 4, 8, 9, 27, 16, 64, 25, 125, 36, 216, 49, 343, 64, 512, 81, 729, 100, 1000)

5. glom

作用： 将每一个分区的元素合并成一个数组，形成新的 RDD 类型是 RDD[Array[T]]。
示例：

// 创建⼀个 4 个分区的 RDD，并将每个分区的数据放到⼀个数组

scala> val rdd = sc.makeRDD(1 to 10, 4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[49] at makeRDD at <console>:24

scala> val newRdd= rdd.glom
newRdd: org.apache.spark.rdd.RDD[Array[Int]] = MapPartitionsRDD[50] at glom at <console>:26

scala> newRdd.collect
res36: Array[Array[Int]] = Array(Array(1, 2), Array(3, 4, 5), Array(6, 7), Array(8, 9, 10))

6. groupBy(func)

作用： 按照 func 的返回值进行分组，func返回值作为 key ，对应的值放入一个迭代器中，返回的 RDD 类型是 RDD[(K, Iterable[T])]，每组内元素的顺序不能保证，并且甚至每次调用得到的顺序也有可能不同。
示例：

// 创建⼀个 RDD，按照元素的奇偶性进⾏分组

scala> val rdd = sc.makeRDD(1 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[51] at makeRDD at <console>:24

scala> val newRdd = rdd.groupBy(e => if(e % 2 == 0) "even" else "odd")
newRdd: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[53] at groupBy at <console>:26

scala> newRdd.collect
res37: Array[(String, Iterable[Int])] = Array((even,CompactBuffer(2, 4, 6, 8, 10)), (odd,CompactBuffer(1, 3, 5, 7, 9)))

7. filter(func)

作用： 过滤，返回的新的 RDD 是由 func 的返回值为 true 的那些元素组成。
示例：

// 创建⼀个 RDD（由字符串组成），过滤出⼀个新 RDD（包含"xiao"⼦串）

scala> val rdd = sc.makeRDD(Array("xiaozhu", "xiaozhang", "zhu", "zhang", "xiaohong"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[54] at makeRDD at <console>:24

scala> val newRdd = rdd.filter(_.contains("xiao"))
newRdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[55] at filter at <console>:26

scala> newRdd.collect
res38: Array[String] = Array(xiaozhu, xiaozhang, xiaohong)

8. sample(withReplacement, fraction, seed)

作用：
1.1 以指定的随机种⼦随机抽样出⽐例为 fraction 的数据，(抽取到的数量是: size * fraction )，需要注意的是得到的结果并不能保证准确的⽐例。
1.2 withReplacement 表示是抽出的数据是否放回， true 为有放回的抽样， false 为⽆放回的抽样。放回表示数据有可能会被重复抽取到, false 则不可能重复抽取到。如果是 false ，则 fraction 必须是: [0,1] , 是 true 则⼤于等于 0 就可以了。
1.3 seed ⽤于指定随机数⽣成器种⼦。⼀般⽤默认的，或者传⼊当前的时间戳。
示例：

// 不放回抽样

scala> val rdd = sc.makeRDD(1 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[56] at makeRDD at <console>:24

scala> val newRdd = rdd.sample(false, 0.5)
newRdd: org.apache.spark.rdd.RDD[Int] = PartitionwiseSampledRDD[57] at sample at <console>:26

scala> newRdd.collect
res39: Array[Int] = Array(1, 2, 4, 5, 6, 7, 8, 9)

// 放回抽样

scala> newRdd.collect
res39: Array[Int] = Array(1, 2, 4, 5, 6, 7, 8, 9)

scala> val rdd = sc.makeRDD(1 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[58] at makeRDD at <console>:24

scala> val newRdd = rdd.sample(true, 1.5)
newRdd: org.apache.spark.rdd.RDD[Int] = PartitionwiseSampledRDD[59] at sample at <console>:26

scala> newRdd.collect
res40: Array[Int] = Array(1, 1, 1, 2, 5, 5, 7, 8, 8, 9, 9, 9, 9, 10)

9. distinct([numTasks])

作用： 对 RDD 中的元素执行去重操作，参数表示任务的数量，默认值和分区数保持一致。
示例：

// 对 RDD 的元素进行去重

scala> val rdd = sc.makeRDD(Array(1, 1, 1, 2, 5, 5, 7, 8, 8, 9, 9, 9, 9, 10))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[60] at makeRDD at <console>:24

scala> val newRdd = rdd.distinct
newRdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[63] at distinct at <console>:26

scala> newRdd.collect
res41: Array[Int] = Array(8, 1, 9, 5, 10, 2, 7)

10 coalesce(numPartitions, shuffle)

作用： 缩减分区数到指定的数量，用于大数据集过滤后，提高小数据集的执行效率。第⼆个参数表示是否 shuffle , 如果不传或者传⼊的为 false , 则表示不进⾏ shuffer , 则分区数减少有效, 增加分区数⽆效。
示例：

// 将 4 个分区缩减为 2 个分区

scala> val rdd = sc.makeRDD(1 to 10, 4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[64] at makeRDD at <console>:24

scala> rdd.partitions.length
res42: Int = 4

scala> val newRdd = rdd.coalesce(2)
newRdd: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[65] at coalesce at <console>:26

scala> newRdd.partitions.length
res43: Int = 2

11.repartition(numPartitions)

作用： 根据新的分区数，重新 shuffle 所有数据，这个操作总会通过网络，新的分区数相比之前可以多，也可以少。
示例：

// 将 2 个分区，扩充至 4 个分区

scala> val rdd = sc.makeRDD(1 to 10, 2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[66] at makeRDD at <console>:24

scala> rdd.partitions.length
res44: Int = 2

scala> val newRdd = rdd.repartition(4)
newRdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[70] at repartition at <console>:26

scala> newRdd.partitions.length
res45: Int = 4

12. sortBy(func, [ascending], [numTasks])

作用： 使用 func 先对数据进行处理，按照处理后的数据比较结果排序，默认为正序。
示例：

scala> val rdd = sc.makeRDD(Array(4,6,9,1,4,8,5,9,0))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[71] at makeRDD at <console>:24

// 正序
scala> val newRdd = rdd.sortBy(x => x)
newRdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[76] at sortBy at <console>:26

scala> newRdd.collect
res46: Array[Int] = Array(0, 1, 4, 4, 5, 6, 8, 9, 9)

// 正序
scala> val newRdd = rdd.sortBy(x => x, true)
newRdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[81] at sortBy at <console>:26

scala> newRdd.collect
res47: Array[Int] = Array(0, 1, 4, 4, 5, 6, 8, 9, 9)

// 倒序
scala> val newRdd = rdd.sortBy(x => x, false)
newRdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[86] at sortBy at <console>:26

scala> newRdd.collect
res48: Array[Int] = Array(9, 9, 8, 6, 5, 4, 4, 1, 0)

13. pipe(command, [envVars])

作用： 管道，针对每个分区，通过管道传递给 shell 命令或脚本，返回输出的 RDD 。每个分区执行一次这个命令，如果只有一个分区，则执行一次命令。
示例：
2.1 创建一个脚本文件

echo "hello"
while read line; do
  echo ">>>"$line
done

2.2 只有一个分区的 RDD

scala> val rdd = sc.makeRDD(1 to 5, 1)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[91] at makeRDD at <console>:24

scala> val newRdd = rdd.pipe("./")
newRdd: org.apache.spark.rdd.RDD[String] = PipedRDD[92] at pipe at <console>:26

scala> newRdd.collect
res51: Array[String] = Array(hello, >>>1, >>>2, >>>3, >>>4, >>>5)

2.3 多个分区的 RDD

scala> val rdd = sc.makeRDD(1 to 5, 2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[93] at makeRDD at <console>:24

scala> val newRdd = rdd.pipe("./")
newRdd: org.apache.spark.rdd.RDD[String] = PipedRDD[94] at pipe at <console>:26

scala> newRdd.collect
res52: Array[String] = Array(hello, >>>1, >>>2, hello, >>>3, >>>4, >>>5)

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：android 摄像头自动对焦代码

下一篇：透传消息java

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯

sparksql转换百分数

sparksql转换百分数

RDD转换算子之单Value类型

文章目录

1. map(func)

2. mapPartitions(func)

3. mapPartitionsWithIndex(func)

4. flatMap(func)

5. glom

6. groupBy(func)

7. filter(func)

8. sample(withReplacement, fraction, seed)

9. distinct([numTasks])

10 coalesce(numPartitions, shuffle)

11.repartition(numPartitions)

12. sortBy(func, [ascending], [numTasks])

13. pipe(command, [envVars])

51CTO博客