本人只针对在此次案列中,对于处理数据量大,内存溢出,效率低等问题的代码改善措施,拿来与大家分享如有改善意见,请多指教.


元数据


需求:在数据中提取课程和老师的信息,并对访问量进行排序.


方案一:


package day04

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

object FavoriteTeacher01 {
  def main(args: Array[String]): Unit = {
    val conf: SparkConf = new SparkConf().setAppName(this.getClass.getSimpleName)
    var isLocal = args(0).toBoolean
    if (isLocal) {
      conf.setMaster("local[*]")
    }
    val sc: SparkContext = new SparkContext(conf)
    val lines: RDD[String] = sc.textFile(args(1))
    //解析url  -->http://bigdata.51doit.cn/laozhang
    val subjectAndTeacher: RDD[((String, String), Int)] = lines.map(line => {
      val st: Array[String] = line.split("/")
      val subject: String = st(2).split("[.]")(0)
      val teacher: String = st(3)
      ((subject, teacher), 1)
    })
   
    //聚合
    val reduce: RDD[((String, String), Int)] = subjectAndTeacher.reduceByKey(_ + _)
    val sort: RDD[((String, String), Int)] = reduce.sortBy(_._2, false)
    //按照学科进行分组
    val group: RDD[(String, Iterable[((String, String), Int)])] = sort.groupBy(_._1._1)
    //参数二  设立排名
    val topN: Int = args(2).toInt
    val result: RDD[(String, List[((String, String), Int)])] = group.mapValues(_.toList.sortBy(-_._2).take(topN))
    //触发Action,打印
    val r: Array[(String, List[((String, String), Int)])] = result.collect()
    println(r.toBuffer)
    sc.stop()
  }

}

方案一:如果在一些特殊场景下,一组内的数据过多,可能会出现内存溢出
改善方案:可以将reduceByKey操作后的结果存入内存中,避免反复读取HDFS(数据源)中的数据和重复计算,同时,自定义分区器,将每个学科对于一个分区.代码如下:


方案二


package day04

import Util.SubjectPartitioner
import org.apache.spark.rdd.RDD
import org.apache.spark.{Partitioner, SparkConf, SparkContext}

import scala.collection.mutable

object FavoriteTeacher02 {
  def main(args: Array[String]): Unit = {
    val isLocal = args(0).toBoolean
    //创建SparkConf,然后创建SparkContext
    val conf = new SparkConf().setAppName(this.getClass.getSimpleName)
    if (isLocal) {
      conf.setMaster("local[*]")
    }
    val sc = new SparkContext(conf)
    //指定以后从哪里读取数据创建RDD
    val lines: RDD[String] = sc.textFile(args(1))
    //对数据进行切分
    val subjectTeacherAndOne = lines.map(line => {
      val fields = line.split("/")
      val subject = fields(2).split("[.]")(0)
      val teacher = fields(3)
      ((subject, teacher), 1)
    })
    //聚合
    val reduced: RDD[((String, String), Int)] = subjectTeacherAndOne.reduceByKey(_ + _)
    //全局排序,我想要的是分组TopN
    //val sorted = reduced.sortBy(_._2, false)
    //将reduced cache到内存
    reduced.cache()
    //计算所有的学科,并收集到Driver端
    val subjects: Array[String] = reduced.map(_._1._1).distinct().collect()

    //partitioner是在Driver端被New出来的,但是他的方法是在Executor中被调用的
    val partitioner = new SubjectPartitioner(subjects)

    //reduced使用指定的的分区器对数据进行分区
    val partitionedRDD: RDD[((String, String), Int)] = reduced.partitionBy(partitioner)
    //将每一个分区中的数据进行处理
    val result: RDD[((String, String), Int)] = partitionedRDD.mapPartitions(it => it.toList.sortBy(-_._2).take(2).iterator)
    val r: Array[((String, String), Int)] = result.collect()
    println(r.toBuffer)
    sc.stop()
  }
}

方案二:缺点是效率比较低,列如(reduceByKey,partitionBy,mapPartitions)这些都可以产生shuffle.如果处理数据量大的情况下,依然会有内存溢出这种问题的产生,如果,我们在每个分区中,用foreach遍历每一条数据,并且加入适当的条件,那样就可以彻底的避免这样的问题出现.代码如下:

package day04

import Util.SubjectPartitioner
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

import scala.collection.mutable

object FavoriteTeacher03 {
  def main(args: Array[String]): Unit = {
    val isLocal = args(0).toBoolean
    //创建SparkConf,然后创建SparkContext
    val conf = new SparkConf().setAppName(this.getClass.getSimpleName)
    if (isLocal) {
      conf.setMaster("local[*]")
    }
    val sc = new SparkContext(conf)
    //指定以后从哪里读取数据创建RDD
    val lines: RDD[String] = sc.textFile(args(1))
    //对数据进行切分
    val subjectTeacherAndOne = lines.map(line => {
      val fields = line.split("/")
      val subject = fields(2).split("[.]")(0)
      val teacher = fields(3)
      ((subject, teacher), 1)
    })

    //计算所有的学科,并收集到Driver端
    val subjects: Array[String] = subjectTeacherAndOne.map(_._1._1).distinct().collect()
    //paritioner是在Driver端被new出来的,但是他的方法是在Executor中被调用的
    val partitioner = new SubjectPartitioner(subjects)
    //根据指定的key和分区器进行聚合(减少一次shuffle)
    /**
     * def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
     * combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
     * }
     * 以上是reduceByKey的源码,可知,它的参数可以是一个自定义的Partitioner,我么可以自定义partitioner防止后期因数据过大,导致
     * 分区内部数据量大造成的溢出
     */
    val reduced: RDD[((String, String), Int)] = subjectTeacherAndOne.reduceByKey(partitioner, _ + _)
    val topN = args(2).toInt
    val result: RDD[((String, String), Int)] = reduced.mapPartitions(it => {
      //定义一个排序规则  隐式函数
      implicit val sortRules = Ordering[Int].on[((String, String), Int)](t => -t._2)
      //定义一个key排序的集合TreeSet(它有自己默认的排序规则,需要自定义)
      val sorter: mutable.TreeSet[((String, String), Int)] = new mutable.TreeSet[((String, String), Int)]()
      //遍历出迭代器中的数据
      it.foreach(t => {
        sorter += t
        if (sorter.size > topN) {
          val last = sorter.last
          //移除最后一个
          sorter -= last
        }
      })
      sorter.iterator

    }
    )
    val r: Array[((String, String), Int)] = result.collect()
    println(r.toBuffer)
    sc.stop()
  }
}

输出自己的main参数,即可运行.方案三是最优的解决方案.如有问题,请多指教.