本人只针对在此次案列中,对于处理数据量大,内存溢出,效率低等问题的代码改善措施,拿来与大家分享如有改善意见,请多指教.
元数据
需求:在数据中提取课程和老师的信息,并对访问量进行排序.
方案一:
package day04
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object FavoriteTeacher01 {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName(this.getClass.getSimpleName)
var isLocal = args(0).toBoolean
if (isLocal) {
conf.setMaster("local[*]")
}
val sc: SparkContext = new SparkContext(conf)
val lines: RDD[String] = sc.textFile(args(1))
//解析url -->http://bigdata.51doit.cn/laozhang
val subjectAndTeacher: RDD[((String, String), Int)] = lines.map(line => {
val st: Array[String] = line.split("/")
val subject: String = st(2).split("[.]")(0)
val teacher: String = st(3)
((subject, teacher), 1)
})
//聚合
val reduce: RDD[((String, String), Int)] = subjectAndTeacher.reduceByKey(_ + _)
val sort: RDD[((String, String), Int)] = reduce.sortBy(_._2, false)
//按照学科进行分组
val group: RDD[(String, Iterable[((String, String), Int)])] = sort.groupBy(_._1._1)
//参数二 设立排名
val topN: Int = args(2).toInt
val result: RDD[(String, List[((String, String), Int)])] = group.mapValues(_.toList.sortBy(-_._2).take(topN))
//触发Action,打印
val r: Array[(String, List[((String, String), Int)])] = result.collect()
println(r.toBuffer)
sc.stop()
}
}
方案一:如果在一些特殊场景下,一组内的数据过多,可能会出现内存溢出
改善方案:可以将reduceByKey操作后的结果存入内存中,避免反复读取HDFS(数据源)中的数据和重复计算,同时,自定义分区器,将每个学科对于一个分区.代码如下:
方案二
package day04
import Util.SubjectPartitioner
import org.apache.spark.rdd.RDD
import org.apache.spark.{Partitioner, SparkConf, SparkContext}
import scala.collection.mutable
object FavoriteTeacher02 {
def main(args: Array[String]): Unit = {
val isLocal = args(0).toBoolean
//创建SparkConf,然后创建SparkContext
val conf = new SparkConf().setAppName(this.getClass.getSimpleName)
if (isLocal) {
conf.setMaster("local[*]")
}
val sc = new SparkContext(conf)
//指定以后从哪里读取数据创建RDD
val lines: RDD[String] = sc.textFile(args(1))
//对数据进行切分
val subjectTeacherAndOne = lines.map(line => {
val fields = line.split("/")
val subject = fields(2).split("[.]")(0)
val teacher = fields(3)
((subject, teacher), 1)
})
//聚合
val reduced: RDD[((String, String), Int)] = subjectTeacherAndOne.reduceByKey(_ + _)
//全局排序,我想要的是分组TopN
//val sorted = reduced.sortBy(_._2, false)
//将reduced cache到内存
reduced.cache()
//计算所有的学科,并收集到Driver端
val subjects: Array[String] = reduced.map(_._1._1).distinct().collect()
//partitioner是在Driver端被New出来的,但是他的方法是在Executor中被调用的
val partitioner = new SubjectPartitioner(subjects)
//reduced使用指定的的分区器对数据进行分区
val partitionedRDD: RDD[((String, String), Int)] = reduced.partitionBy(partitioner)
//将每一个分区中的数据进行处理
val result: RDD[((String, String), Int)] = partitionedRDD.mapPartitions(it => it.toList.sortBy(-_._2).take(2).iterator)
val r: Array[((String, String), Int)] = result.collect()
println(r.toBuffer)
sc.stop()
}
}
方案二:缺点是效率比较低,列如(reduceByKey,partitionBy,mapPartitions)这些都可以产生shuffle.如果处理数据量大的情况下,依然会有内存溢出这种问题的产生,如果,我们在每个分区中,用foreach遍历每一条数据,并且加入适当的条件,那样就可以彻底的避免这样的问题出现.代码如下:
package day04
import Util.SubjectPartitioner
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.mutable
object FavoriteTeacher03 {
def main(args: Array[String]): Unit = {
val isLocal = args(0).toBoolean
//创建SparkConf,然后创建SparkContext
val conf = new SparkConf().setAppName(this.getClass.getSimpleName)
if (isLocal) {
conf.setMaster("local[*]")
}
val sc = new SparkContext(conf)
//指定以后从哪里读取数据创建RDD
val lines: RDD[String] = sc.textFile(args(1))
//对数据进行切分
val subjectTeacherAndOne = lines.map(line => {
val fields = line.split("/")
val subject = fields(2).split("[.]")(0)
val teacher = fields(3)
((subject, teacher), 1)
})
//计算所有的学科,并收集到Driver端
val subjects: Array[String] = subjectTeacherAndOne.map(_._1._1).distinct().collect()
//paritioner是在Driver端被new出来的,但是他的方法是在Executor中被调用的
val partitioner = new SubjectPartitioner(subjects)
//根据指定的key和分区器进行聚合(减少一次shuffle)
/**
* def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
* combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
* }
* 以上是reduceByKey的源码,可知,它的参数可以是一个自定义的Partitioner,我么可以自定义partitioner防止后期因数据过大,导致
* 分区内部数据量大造成的溢出
*/
val reduced: RDD[((String, String), Int)] = subjectTeacherAndOne.reduceByKey(partitioner, _ + _)
val topN = args(2).toInt
val result: RDD[((String, String), Int)] = reduced.mapPartitions(it => {
//定义一个排序规则 隐式函数
implicit val sortRules = Ordering[Int].on[((String, String), Int)](t => -t._2)
//定义一个key排序的集合TreeSet(它有自己默认的排序规则,需要自定义)
val sorter: mutable.TreeSet[((String, String), Int)] = new mutable.TreeSet[((String, String), Int)]()
//遍历出迭代器中的数据
it.foreach(t => {
sorter += t
if (sorter.size > topN) {
val last = sorter.last
//移除最后一个
sorter -= last
}
})
sorter.iterator
}
)
val r: Array[((String, String), Int)] = result.collect()
println(r.toBuffer)
sc.stop()
}
}
输出自己的main参数,即可运行.方案三是最优的解决方案.如有问题,请多指教.