spark分析评论 spark案例分析

转载

代码工匠大师 2023-11-06 18:42:48

文章标签 spark分析评论 spark 大数据 ci 数据 文章分类 Spark 大数据

数据说明

在前面的博客中已经介绍了了 Spark 的基础编程方式，接下来，再看下在实际的工作中如何使用这些 API 实现具体的需求。这些需求是电商网站的真实需求，所以在实现功能前，先将数据准备好。

数据文档链接提取码：xzc6

spark分析评论 spark案例分析_spark

上面的数据图是从数据文件中截取的一部分内容，表示为电商网站的用户行为数据，主要包含用户的4种行为：搜索，点击，下单，支付。数据规则如下：

（1）数据文件中每行数据采用下划线分隔数据

（2）每一行数据表示用户的一次行为，这个行为只能是4 种行为的一种

（3）如果搜索关键字为 null,表示数据不是搜索数据

（4）如果点击的品类 ID 和产品ID 为-1，表示数据不是点击数据

（5）针对于下单行为，一次可以下单多个商品，所以品类 ID 和产品ID 可以是多个，id 之间采用逗号分隔，如果本次不是下单行为，则数据采用 null 表示

（6）支付行为和下单行为类似

详细字段说明：

spark分析评论 spark案例分析_ci_02

spark分析评论 spark案例分析_spark_03

根据上面的字段，自定义样例类：

//用户访问动作表 
case class UserVisitAction( 
	 date: String,//用户点击行为的日期 
	 user_id: Long,//用户的 ID 
	 session_id: String,//Session 的 ID 
	 page_id: Long,//某个页面的 ID 
	 action_time: String,//动作的时间点 
	 search_keyword: String,//用户搜索的关键词 
	 click_category_id: Long,//某一个商品品类的 ID 
	 click_product_id: Long,//某一个商品的 ID 
	 order_category_ids: String,//一次订单中所有品类的 ID 集合 
	 order_product_ids: String,//一次订单中所有商品的 ID 集合 
	 pay_category_ids: String,//一次支付中所有品类的 ID 集合 
	 pay_product_ids: String,//一次支付中所有商品的 ID 集合 
     city_id: Long 
)//城市 id

需求1：Top10 热门品类

需求说明

品类是指产品的分类，大型电商网站品类分多级，咱们的项目中品类只有一级，不同的公司可能对热门的定义不一样。我们按照每个品类的点击、下单、支付的量来统计热门品类。

鞋 	 	点击数 	下单数	 支付数 
衣服 	点击数 	下单数 	 支付数 
电脑 	点击数 	下单数 	 支付数

例如，综合排名 = 点击数20%+下单数30%+支付数*50%

本项目需求优化为：先按照点击数排名，靠前的就排名高；如果点击数相同，再比较下单数；下单数再相同，就比较支付数。

实现方案一

需求分析

分别统计每个品类点击的次数，下单的次数和支付的次数：
（品类，点击总数）（品类，下单总数）（品类，支付总数）

需求实现

package com.atguigu.bigdata.spark.core.rdd.req

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

object Spark01_Req1_HotCateforyTop10Analysis {

  def main(args: Array[String]): Unit = {

    //TODO : Top热门品类
    val sparkConf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("HotCateforyTop10Analysis")
    val sc = new SparkContext(sparkConf)


    //1. 读取原始日志数据
    val actionRDD: RDD[String] = sc.textFile("datas/user_visit_action.txt")

    //2. 统计品类的点击数量（品类ID，点击数量）
    val clickActionRDD: RDD[String] = actionRDD.filter(
      action => {
        val datas: Array[String] = action.split("_")
        datas(6) != "-1"
      }
    )

    val clickCountRDD: RDD[(String, Int)] = clickActionRDD.map(
      action => {
        val datas: Array[String] = action.split("_")
        (datas(6), 1)
      }
    ).reduceByKey(_ + _)

    //3. 统计品类的下单数量（品类ID，下单数量）
    val orderActionRDD: RDD[String] = actionRDD.filter(
      action => {
        val datas: Array[String] = action.split("_")
        datas(8) != "null"
      }
    )

    val orderCountRDD: RDD[(String, Int)] = orderActionRDD.flatMap(
      action => {
        val datas: Array[String] = action.split("_")
        val cid = datas(8)
        val cids = cid.split(",")

        //扁平化
        cids.map(id => (id, 1))

      }
    ).reduceByKey(_ + _)


    //4. 统计品类的支付数量（品类ID，支付数量）
    val payActionRDD: RDD[String] = actionRDD.filter(
      action => {
        val datas: Array[String] = action.split("_")
        datas(10) != "null"
      }
    )

    val payCountRDD: RDD[(String, Int)] = payActionRDD.flatMap(
      action => {
        val datas: Array[String] = action.split("_")
        val cid = datas(10)
        val cids = cid.split(",")

        //扁平化
        cids.map(id => (id, 1))

      }
    ).reduceByKey(_ + _)

    //5.将品类进行排序，并且取前十名
    // 点击数量排序，下单数量的排序，支付数量排序
    // 元组排序：先比较第一个，在比较第二个，依次类推
    // （品类ID， （点击数量，下单数量，支付数量））
    //join(不可以),zip（不可以）,leftOutJoin（不可以）,cogroup （可以）连接数据

    val cogroupRDD: RDD[(String, (Iterable[Int], Iterable[Int], Iterable[Int]))] =
      clickCountRDD.cogroup(orderCountRDD, payCountRDD)

    val analysisRDD = cogroupRDD.mapValues{
      case (clickIter, orderIter, payIter) => {

        var clickCnt = 0
        val iter1 = clickIter.iterator
        if (iter1.hasNext){
          clickCnt = iter1.next()
        }
        var orderCnt = 0
        val iter2 = orderIter.iterator
        if (iter2.hasNext){
          orderCnt = iter2.next()
        }
        var payCnt = 0
        val iter3 = payIter.iterator
        if (iter3.hasNext){
          payCnt = iter3.next()
        }
        (clickCnt,orderCnt,payCnt)
      }
    }

    val resultRDD: Array[(String, (Int, Int, Int))] = analysisRDD.sortBy(_._2, false).take(10)

    //6.将结果采集到控制台

    resultRDD.foreach(println)

    sc.stop()

  }

}

实现方案二

需求分析

一次性统计每个品类点击的次数，下单的次数和支付的次数：
（品类，（点击总数，下单总数，支付总数））

需求实现

package com.atguigu.bigdata.spark.core.rdd.req

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

object Spark02_Req1_HotCateforyTop10Analysis1 {

  def main(args: Array[String]): Unit = {

    //TODO : Top热门品类
    val sparkConf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("Spark02_Req1_HotCateforyTop10Analysis1")
    val sc = new SparkContext(sparkConf)

    //Q： actionRDD重复使用
    //Q：cogr性能可能较低


    //1. 读取原始日志数据
    val actionRDD: RDD[String] = sc.textFile("datas/user_visit_action.txt")
    actionRDD.cache() //解决重复使用

    //2. 统计品类的点击数量（品类ID，点击数量）
    val clickActionRDD: RDD[String] = actionRDD.filter(
      action => {
        val datas: Array[String] = action.split("_")
        datas(6) != "-1"
      }
    )

    val clickCountRDD: RDD[(String, Int)] = clickActionRDD.map(
      action => {
        val datas: Array[String] = action.split("_")
        (datas(6), 1)
      }
    ).reduceByKey(_ + _)

    //3. 统计品类的下单数量（品类ID，下单数量）
    val orderActionRDD: RDD[String] = actionRDD.filter(
      action => {
        val datas: Array[String] = action.split("_")
        datas(8) != "null"
      }
    )

    val orderCountRDD: RDD[(String, Int)] = orderActionRDD.flatMap(
      action => {
        val datas: Array[String] = action.split("_")
        val cid = datas(8)
        val cids = cid.split(",")

        //扁平化
        cids.map(id => (id, 1))

      }
    ).reduceByKey(_ + _)


    //4. 统计品类的支付数量（品类ID，支付数量）
    val payActionRDD: RDD[String] = actionRDD.filter(
      action => {
        val datas: Array[String] = action.split("_")
        datas(10) != "null"
      }
    )

    val payCountRDD: RDD[(String, Int)] = payActionRDD.flatMap(
      action => {
        val datas: Array[String] = action.split("_")
        val cid = datas(10)
        val cids = cid.split(",")

        //扁平化
        cids.map(id => (id, 1))

      }
    ).reduceByKey(_ + _)


    //5.将品类进行排序，并且取前十名
    // 点击数量排序，下单数量的排序，支付数量排序
    // 元组排序：先比较第一个，在比较第二个，依次类推
    // （品类ID， （点击数量，下单数量，支付数量））
    //join(不可以),zip（不可以）,leftOutJoin（不可以）,cogroup （可以）连接数据

    //cogroup有可能存在shuffle

    //换一个方式实现
     （品类ID， 点击数量）=>（品类ID， （点击数量））=>（品类ID， （点击数量，0，0））
     （品类ID， 下单数量）=>（品类ID， （下单数量））=>（品类ID， （0，下单数量，0））
     （品类ID， 支付数量）=>（品类ID， （支付数量））=>（品类ID， （0，0，支付数量））
    //之后再两两聚合

    val rdd1 = clickCountRDD.map{
      case (cid, cnt) => {
        (cid, (cnt,0,0))
      }
    }

    val rdd2 = orderCountRDD.map{
      case (cid, cnt) => {
        (cid, (0,cnt,0))
      }
    }

    val rdd3 = payCountRDD.map{
      case (cid, cnt) => {
        (cid, (0,0,cnt))
      }
    }

    //将三个数据源合并在一起，统一进行聚合计算
    val sourceRDD: RDD[(String, (Int, Int, Int))] = rdd1.union(rdd2).union(rdd3)

    val analysisRDD: RDD[(String, (Int, Int, Int))] = sourceRDD.reduceByKey(
      (t1, t2) => {
        (t1._1 + t2._1, t1._2 + t2._2, t1._3 + t2._3)
      }
    )

    val resultRDD: Array[(String, (Int, Int, Int))] = analysisRDD.sortBy(_._2, false).take(10)

    //6.将结果采集到控制台

    resultRDD.foreach(println)

    sc.stop()

  }

}

实现方案三

需求分析

使用累加器的方式聚合数据

需求实现

package com.atguigu.bigdata.spark.core.rdd.req

import org.apache.spark.rdd.RDD
import org.apache.spark.util.AccumulatorV2
import org.apache.spark.{SparkConf, SparkContext}

import scala.collection.mutable

object Spark04_Req1_HotCateforyTop10Analysis3 {

  def main(args: Array[String]): Unit = {

    //TODO : Top热门品类
    val sparkConf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("Spark03_Req1_HotCateforyTop10Analysis2")
    val sc = new SparkContext(sparkConf)

    //Q：存在shuffle操作（reduceByKey）
    //使用累加器


    //1. 读取原始日志数据
    val actionRDD: RDD[String] = sc.textFile("datas/user_visit_action.txt")

    val acc = new HotCategoryAccumulator
    sc.register(acc,"HotCategory")
    //2. 将数据转换结构

    actionRDD.foreach(
      action => {
        val datas: Array[String] = action.split("_")

        if (datas(6) != "-1") {
          //点击场合
          acc.add((datas(6),"click"))
        } else if (datas(8) != "null") {
          //下单的场合
          val ids: Array[String] = datas(8).split(",")
          ids.foreach(
            id => {
              acc.add((id,"order"))
            }
          )

        } else if (datas(10) != "null") {
          //支付的场合
          val ids = datas(10).split(",")
          ids.foreach(
            id => {
              acc.add((id,"order"))
            }
          )
        }
      }
    )

    val accVal: mutable.Map[String, HotCategory] = acc.value

    val categories: mutable.Iterable[HotCategory] = accVal.map(_._2)

    val sort: List[HotCategory] = categories.toList.sortWith(
      (left, right) => {
        if (left.clickCnt > right.clickCnt) {
          true
        } else if (left.clickCnt == right.clickCnt) {
          if (left.orderCnt > right.orderCnt) {
            true
          } else if (left.orderCnt == right.orderCnt) {
            if (left.parCnt >= right.parCnt) {
              true
            } else {
              false
            }
          }
          else {
            false
          }
        }
        else {
          false
        }
      }
    )

    //5.将结果采集到控制台

    sort.take(10).foreach(println)

    sc.stop()

  }

  case class HotCategory(cid: String, var clickCnt: Int, var orderCnt: Int, var parCnt: Int)
  /**
   * 自定义累加器
   * 1.继承AccumulatorV2，定义泛型
   *    IN: （品类ID，行为类型）
   *    OUT：mutable.Map[String,HotCategory]
   *
   * 2. 重写方法（6个）
   */

  class HotCategoryAccumulator extends AccumulatorV2[(String,String),mutable.Map[String,HotCategory]]{

    private val hcMap = mutable.Map[String,HotCategory]()

    override def isZero: Boolean = {
      hcMap.isEmpty
    }

    override def copy(): AccumulatorV2[(String, String), mutable.Map[String, HotCategory]] = {
      new HotCategoryAccumulator
    }

    override def reset(): Unit = {
      hcMap.clear()
    }

    override def add(v: (String, String)): Unit = {
      val cid: String = v._1
      val actionType: String = v._2
      val category: HotCategory = hcMap.getOrElse(cid, HotCategory(cid, 0, 0, 0))

      if (actionType == "click"){
        category.clickCnt += 1
      }else if (actionType == "order"){
        category.orderCnt += 1
      }else if (actionType == "pay"){
        category.parCnt += 1
      }

      hcMap.update(cid,category)
    }

    override def merge(other: AccumulatorV2[(String, String), mutable.Map[String, HotCategory]]): Unit = {
      val map1 = this.hcMap
      val map2 = other.value

      map2.foreach{
        case (cid, hc) => {
          val category: HotCategory = map1.getOrElse(cid, HotCategory(cid, 0, 0, 0))
          category.clickCnt += hc.clickCnt
          category.orderCnt += hc.orderCnt
          category.parCnt += hc.parCnt
          map1.update(cid,category)
        }
      }
    }

    override def value: mutable.Map[String, HotCategory] = hcMap
  }
}

需求 2：Top10 热门品类中每个品类的 Top10 活跃Session 统计

需求说明

在需求一的基础上，增加每个品类用户session 的点击统计

需求分析

根据品类ID和Session进行点击量的统计，将统计的结果进行转换
（（品类ID，sessionID ），sum） => （品类ID，（sessionID， sum））

需求实现

package com.atguigu.bigdata.spark.core.rdd.req

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

object Spark05_Req2_HotCateforyTop10SessionAnalysis2 {

  def main(args: Array[String]): Unit = {

    //TODO : Top热门品类
    val sparkConf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("Spark05_Req2_HotCateforyTop10SessionAnalysis2")
    val sc = new SparkContext(sparkConf)


    val actionRDD: RDD[String] = sc.textFile("datas/user_visit_action.txt")
    actionRDD.cache()

    val top10Ids: Array[String] = top10Category(actionRDD)


    //1. 过滤原始数据，保留点击和前10品类ID

    val filterActionRDD: RDD[String] = actionRDD.filter(
      action => {
        val datas = action.split("_")
        if (datas(6) != "-1") {
          top10Ids.contains(datas(6))
        } else {
          false
        }
      }
    )

    //2. 根据品类ID和Session进行点击量的统计

    val reduceRDD: RDD[((String, String), Int)] = filterActionRDD.map(
      action => {
        val datas = action.split("_")

        ((datas(6), datas(2)), 1)
      }
    ).reduceByKey(_ + _)

  //3.将统计的结果进行转换
    //（（品类ID，sessionID ），sum） => （品类ID， （sessionID， sum））

    val mapRDD = reduceRDD.map {
      case ((cid, sid),sum) => {
        (cid, (sid,sum))
      }
    }

    //4. 相同的品类进行分组
    val groupRDD: RDD[(String, Iterable[(String, Int)])] = mapRDD.groupByKey()

    //5. 将分组后的数据进行点击量的排序，取前10名
    val resultRDD: RDD[(String, List[(String, Int)])] = groupRDD.mapValues(
      iter => {
        iter.toList.sortBy(_._2)(Ordering.Int.reverse).take(10)
      }
    )

    resultRDD.collect().foreach(println)

    sc.stop()

  }

  def top10Category(actionRDD: RDD[String])  = {
    val flatRDD: RDD[(String, (Int, Int, Int))] = actionRDD.flatMap(
      action => {
        val datas: Array[String] = action.split("_")

        if (datas(6) != "-1") {
          //点击场合
          List((datas(6), (1, 0, 0)))
        } else if (datas(8) != "null") {
          //下单的场合
          val ids: Array[String] = datas(8).split(",")
          ids.map(id => (id, (0, 1, 0)))

        } else if (datas(10) != "null") {
          //支付的场合
          val ids = datas(10).split(",")
          ids.map(id => (id, (0, 0, 1)))
        } else {
          Nil
        }
      }
    )



    val analysisRDD: RDD[(String, (Int, Int, Int))] = flatRDD.reduceByKey(
      (t1, t2) => {
        (t1._1 + t2._1, t1._2 + t2._2, t1._3 + t2._3)
      }
    )


    analysisRDD.sortBy(_._2, false).take(10).map(_._1)
  }

}

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。