spark mapValues 执行太慢 spark mappartition

转载

mob64ca14040d22 2023-11-07 01:19:29

文章标签 spark rdd实战 mapPartitons spark优化实战数据 文章分类 Spark 大数据

概述

本文讲述map和mapPartitions的相同点和区别。并对mapPartitions优缺点进行总结，并总结了mapPartitions的使用例子。

map和mapPartitions

map	mapPartitions
transformation	transformation
基于一行进行操作	基于一个分区的数据操作
没处理完一行就返回一个对象	处理完一个分区的所有行才返回
不将输出结果保存在内存中	输出保留在内存中，因为它可以在处理完所有行后返回
易于实例化服务（可重用对象）	易于实例化服务（可重用对象）
无法确定何时结束服务（No CleanupMethod）	返回前可以关闭服务

mapPartitions要点

mapPartitions的优势

mapPartitions是一个特殊的map，它在每个分区中只调用一次。
由于它是针对每个分区进行处理，所以，它在数据处理过程中产生的对象会远小于map产生的对象。
当需要处理的数据量很大时mapPartitions不会把数据都加载到内存中，避免由于数据量过大而导致的内存不足的错误。
mapPartitions为了提升运行的效率，在数据处理时它还会进行优化，把该放的数据放到内存中，把其他一些数据放到磁盘中。
mapPartitions函数的处理是通过迭代器进行的，输出的也是迭代器，通过输入参数（Iterarator [T]），各个分区的全部内容都可以作为值的顺序流。
自定义函数必须返回另一个Iterator [U]。组合的结果迭代器会自动转换为新的RDD。
mapPartitions函数返回的是一个RDD类，具体来说是一个：MapPartitionsRDD。
和map不同，由于 mapPartitions是基于每个数据的分区进行处理的，所以在生成对象时也会基于每个分区来生成，而不是针对每条记录来生成。例如：若需要连接外部数据库比如：hbase或mysql等，只会针对每个分区生成一个连接对象。

mapPartitions要注意的问题

mapPartitions是针对每个分区进行处理的，若最后的结果想要得到一个全局范围内的，需要慎重考虑。

mapPartitions使用实战

简单的例子

该例子要实现的功能很简单：把一个整数RDD[Int]的元组修改成RDD[(Int, Int)]，并且设置元组中第二个元素的值为第一个元素的值的2倍。

使用mapPartitions来实现

val a = sc.parallelize(1 to 9, 3)

def doubleFunc(iter: Iterator[Int]) : Iterator[(Int,Int)] = {
    var res = List[(Int,Int)]()
    while (iter.hasNext)
    {
      val cur = iter.next;
      res .::= (cur,cur*2)
    }
    res.iterator
}
  
val result = a.mapPartitions(doubleFunc)
println(result.collect().mkString)

使用map来实现

val a = sc.parallelize(1 to 9, 3)
def mapDoubleFunc(a : Int) : (Int, Int) = {
    (a,a*2)
}
val mapResult = a.map(mapDoubleFunc)

println(mapResult.collect().mkString)

观察性能

在spark-shell终端中，我们可以打开info日志，这样可以看到运行的时间。

sc.setLogLevel("INFO")

我们把整个rdd的数据量扩大到1000000，可以看一下各自处理的需要的时间。
要注意：把数据量扩大后，不需要再打印这些数据的值了，所以不需要执行println这一步，但为了触发action动作，和job的提交，我们需要执行以下简单的一步：

result.take(1)
或
mapResult.take(1)

可以看到，当数据量到达一百万时，通过mapPartitions函数来处理效率更高。

文本单词计数

本例子要实现的功能是：大文件单词计数。
我准备了一个12M的文件(其实不算大)，下面分别通过map和mapPartitions来处理该文件，对文件中的单词进行单词计数。

使用mapPartitions来实现

val dataHDFSPath = "hdfs://hadoop3:7078/user/ubuntu/mldata/txtdata2"
val wordCount = sc.textFile(dataHDFSPath, 3).mapPartitions(lines => {
        lines.flatMap(_.split(" ")).map((_, 1))
  }).
  reduceByKey((total, agg) => total + agg).take(100)

在我的测试环境中，使用mapPartitions共消耗: 0.144277s

通过map来实现

val dataHDFSPath = "hdfs://hadoop3:7078/user/ubuntu/mldata/txtdata2"
val wordCount = sc.textFile(dataHDFSPath, 3).
                    flatMap(line => line.split(" ")).
                    map(word => (word, 1)).
                    reduceByKey { (x, y) => x + y }

wordCount.take(100)

在我的测试环境中，使用map消耗：0.176129s

mapPartitions和Dataframe结合使用

import spark.implicits._
val dataDF = spark.read.format("json").load("basefile")

// 注意：这里遍历时，每一行的类型是RDD[Row]
val newDF = dataDF.mapPartitions( iterator  => {
  // 这里的p是Row类型的数据，这里把它变成了Seq的数据，这里其实是一个List(1,2)
  iterator.map(p => Seq(1, 2)))
}).toDF("value")

newDF.write.json("newfile")

mapPartition使用范式

这里收集了一些使用mapPartition的例子，供后续使用时进行参考。

用法1

def func(it):
    r = f(it)
    try:
        return iter(r)
    except TypeError:
        return iter([])
self.mapPartitions(func).count()  # Force evaluation

用法2

def aggregatePartition(iterator):
            acc = zeroValue
            for obj in iterator:
                acc = seqOp(acc, obj)
            yield acc

partiallyAggregated = self.mapPartitions(aggregatePartition)
numPartitions = partiallyAggregated.getNumPartitions()
scale = max(int(ceil(pow(numPartitions, 1.0 / depth))), 2)

用法3

val OneDocRDD = sc.textFile("myDoc1.txt", 2)
  .mapPartitions(iter => {
    // here you can initialize objects that you would need 
    // that you want to create once by worker and not for each x in the map. 
    iter.map(x => (x._1 , x._2.sliding(n)))
  })

用法4

def onlyEven(numbers: Iterator[Int]) : Iterator[Int] = 
  numbers.filter(_ % 2 == 0)

def partitionSize(numbers: Iterator[Int]) : Iterator[Int] = 
  Iterator.single(numbers.length)

val rdd = sc.parallelize(0 to 10)
rdd.mapPartitions(onlyEven).collect()
// Array[Int] = Array(0, 2, 4, 6, 8, 10)

rdd.mapPartitions(size).collect()
// Array[Int] = Array(2, 3, 3, 3)

用法5

当需要在mapPartitions或map中进行外部连接初始化时，mapPartitions只会为每个分区初始化一次，而map会为每条记录都初始化一次，如下面的例子：

val newRd = myRdd.mapPartitions(partition => {
  // 只会为每个分区创建一个数据库连接
  val connection = new DbConnection /*creates a db connection per partition*/

  // 对分区中的数据进行迭代访问和处理，调用readMatchingFromDB函数来处理每条记录。
  val newPartition = partition.map(record => {
    readMatchingFromDB(record, connection)
  }).toList

  // 关闭数据库连接
  connection.close()
  // 返回新List结果的迭代器
  newPartition.iterator
})

或者使用以下更加简洁的方式：

rdd.mapPartition(
  partitionIter => {
    partitionIter.map(
        line => func() do your logic
        ).toList.toIterator
  }
)

用法6

val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 3)
def myfunc(index: Int, iter: Iterator[Int]) : Iterator[String] = {
    iter.map(x => index + "," + x)
}
val rdd2 = rdd1.mapPartitionsWithIndex(myfunc)

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：kettle数据仓库的建立 kettle数据库连接组件

下一篇：Java编写并调试一个单道处理系统的进程等待模拟程序 java模拟进程调度

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯