文章目录

  • 输出单科成绩为100分的学生ID
  • 使用union()合并多个RDD
  • 使用filter()进行过滤
  • 使用distinct()进行去重
  • 简单的集合操作
  • intersection()
  • subtract()
  • cartesian()
  • 任务实现
  • 创建数据RDD
  • 通过filter操作过滤出成绩为100分的学生数据,并通过map提取学生ID
  • 通过union操作合并所有ID,并利用distinct去重
  • 输出每位学生所有科目的总成绩
  • 创建键值对RDD
  • 转换操作keys与values
  • 转换操作reduceByKey()
  • 转换操作groupByKey()
  • 任务实现
  • 输出每位学生的平均成绩
  • 使用join()连接两个RDD
  • join
  • rightOuterJoin
  • leftOuterJoin
  • fullOuterJoin
  • 使用zip组合两个RDD
  • 使用combineByKey合并相同键的值
  • 使用lookup查找指定键的值
  • 任务实现


输出单科成绩为100分的学生ID

使用union()合并多个RDD

scala> val rdd1=sc.parallelize(List(('a',1),('b',2),('c',3)))
rdd1: org.apache.spark.rdd.RDD[(Char, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> val rdd2=sc.parallelize(List(('a',1),('d',4),('e',5)))
rdd2: org.apache.spark.rdd.RDD[(Char, Int)] = ParallelCollectionRDD[1] at parallelize at <console>:24

scala> rdd1.union(rdd2).collect
res0: Array[(Char, Int)] = Array((a,1), (b,2), (c,3), (a,1), (d,4), (e,5))

使用filter()进行过滤

scala> val rdd1=sc.parallelize(List(('a',1),('b',2),('c',3)))
rdd1: org.apache.spark.rdd.RDD[(Char, Int)] = ParallelCollectionRDD[3] at parallelize at <console>:24

scala> rdd1.filter(_._2>1).collect
res1: Array[(Char, Int)] = Array((b,2), (c,3))

scala> rdd1.filter(x=>x._2>1).collect
res2: Array[(Char, Int)] = Array((b,2), (c,3))

使用distinct()进行去重

scala> val rdd=sc.makeRDD(List(('a',1),('b',1),('a',1),('c',1)))
rdd: org.apache.spark.rdd.RDD[(Char, Int)] = ParallelCollectionRDD[6] at makeRDD at <console>:24

scala> rdd.distinct().collect
res3: Array[(Char, Int)] = Array((b,1), (a,1), (c,1))

简单的集合操作

intersection()

找出两个RDD的交集

scala> val c_rdd1=sc.parallelize(List(('a',1),('b',1),('a',1),('c',1)))
c_rdd1: org.apache.spark.rdd.RDD[(Char, Int)] = ParallelCollectionRDD[12] at parallelize at <console>:24

scala> val c_rdd2=sc.parallelize(List(('a',1),('b',1),('d',1)))
c_rdd2: org.apache.spark.rdd.RDD[(Char, Int)] = ParallelCollectionRDD[13] at parallelize at <console>:24

scala> c_rdd1.intersection(c_rdd2).collect
res4: Array[(Char, Int)] = Array((b,1), (a,1))

subtract()

将前一个RDD中在后一个RDD出现的元素删除

scala> val rdd1=sc.parallelize(List(('a',1),('b',1),('c',1)))
rdd1: org.apache.spark.rdd.RDD[(Char, Int)] = ParallelCollectionRDD[20] at parallelize at <console>:24

scala> val rdd2=sc.parallelize(List(('d',1),('e',1),('c',1)))
rdd2: org.apache.spark.rdd.RDD[(Char, Int)] = ParallelCollectionRDD[21] at parallelize at <console>:24

scala> rdd1.subtract(rdd2).collect
res5: Array[(Char, Int)] = Array((b,1), (a,1))

scala> rdd2.subtract(rdd1).collect
res6: Array[(Char, Int)] = Array((d,1), (e,1))

cartesian()

笛卡尔积

scala> val rdd1=sc.makeRDD(List(1,3,5,7))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[30] at makeRDD at <console>:24

scala> val rdd2=sc.makeRDD(List(2,4,6,8))
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[31] at makeRDD at <console>:24

scala> rdd1.cartesian(rdd2).collect
res7: Array[(Int, Int)] = Array((1,2), (1,4), (3,2), (3,4), (1,6), (1,8), (3,6), (3,8), (5,2), (5,4), (7,2), (7,4), (5,6), (5,8), (7,6), (7,8))

scala> rdd2.cartesian(rdd1).collect
res8: Array[(Int, Int)] = Array((2,1), (2,3), (4,1), (4,3), (2,5), (2,7), (4,5), (4,7), (6,1), (6,3), (8,1), (8,3), (6,5), (6,7), (8,5), (8,7))

任务实现

要找出单科成绩为100分的学生ID,首先需要过滤出两个RDD中成绩为100的学生数据,然后提取学生ID。将两个表得到的学生ID合并到一个RDD中,对学生ID去重,就可以得到所有至少有一科成绩为100分的学生ID。具体实现如下。

创建数据RDD

scala> val bigdata=sc.textFile("/user/root/result_bigdata.txt").map{x=>val line=x.split("\t");(line(0),line(1),line(2).toInt)}
bigdata: org.apache.spark.rdd.RDD[(String, String, Int)] = MapPartitionsRDD[36] at map at <console>:24

scala> val math=sc.textFile("/user/root/result_math.txt").map{x=>val line=x.split("\t");(line(0),line(1),line(2).toInt)}
math: org.apache.spark.rdd.RDD[(String, String, Int)] = MapPartitionsRDD[39] at map at <console>:24

通过filter操作过滤出成绩为100分的学生数据,并通过map提取学生ID

scala> val bigdata_ID=bigdata.filter(x=>x._3==100).map(x=>x._1)
bigdata_ID: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[41] at map at <console>:26

scala> val math_ID=math.filter(x=>x._3==100).map(x=>x._1)
math_ID: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[43] at map at <console>:26

通过union操作合并所有ID,并利用distinct去重

scala> val id=bigdata_ID.union(math_ID).distinct()
id: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[47] at distinct at <console>:32

scala> id.collect
res9: Array[String] = Array(1003, 1007, 1004)

可以看出,返回的学生ID为1003,1007,1004

输出每位学生所有科目的总成绩

创建键值对RDD

对一个由英语单词组成的文本行,提取其中的第一个单词作为Key,将整个句子作为Value,建立PairRDD

scala> val rdd=sc.parallelize(List("this is a test","how are you","I am fine","can you tell me"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[48] at parallelize at <console>:24

scala> val words=rdd.map(x=>(x.split(" ")(0),x));
words: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[49] at map at <console>:26

scala> words.collect
res10: Array[(String, String)] = Array((this,this is a test), (how,how are you), (I,I am fine), (can,can you tell me))

转换操作keys与values

scala> val key=words.keys
key: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[50] at keys at <console>:28

scala> key.collect
res11: Array[String] = Array(this, how, I, can)

scala> val value=words.values
value: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[51] at values at <console>:28

scala> value.collect
res12: Array[String] = Array(this is a test, how are you, I am fine, can you tell me)

转换操作reduceByKey()

scala> val r_rdd=sc.parallelize(List(('a',1),('a',2),('b',1),('c',1),('c',1))).map(x=>(x._1,x._2))
r_rdd: org.apache.spark.rdd.RDD[(Char, Int)] = MapPartitionsRDD[53] at map at <console>:24

scala> val re_rdd=r_rdd.reduceByKey((a,b)=>a+b)
re_rdd: org.apache.spark.rdd.RDD[(Char, Int)] = ShuffledRDD[54] at reduceByKey at <console>:26

scala> re_rdd.collect
res13: Array[(Char, Int)] = Array((b,1), (a,3), (c,2))

转换操作groupByKey()

对具有相同键的值进行分组

scala> val g_rdd=r_rdd.groupByKey()
g_rdd: org.apache.spark.rdd.RDD[(Char, Iterable[Int])] = ShuffledRDD[55] at groupByKey at <console>:26

scala> g_rdd.collect
res14: Array[(Char, Iterable[Int])] = Array((b,CompactBuffer(1)), (a,CompactBuffer(1, 2)), (c,CompactBuffer(1, 1)))

scala> g_rdd.map(x=>(x._1,x._2.size)).collect
res15: Array[(Char, Int)] = Array((b,1), (a,2), (c,2))

任务实现

scala> val all_score=bigdata union math
all_score: org.apache.spark.rdd.RDD[(String, String, Int)] = UnionRDD[57] at union at <console>:28

scala> val score=all_score.map(x=>(x._1,x._3)).reduceByKey((a,b)=>a+b)
score: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[59] at reduceByKey at <console>:30

scala> score.collect
res16: Array[(String, Int)] = Array((1005,184), (1012,175), (1001,186), (1009,173), (1002,188), (1006,174), (1010,164), (1003,200), (1007,190), (1008,187), (1011,170), (1004,199))

输出每位学生的平均成绩

使用join()连接两个RDD

join

对两个RDD进行内连接

scala> val rdd1=sc.parallelize(List(('a',1),('b',2),('c',3)))
rdd1: org.apache.spark.rdd.RDD[(Char, Int)] = ParallelCollectionRDD[60] at parallelize at <console>:24

scala> val rdd2=sc.parallelize(List(('a',1),('d',4),('e',5)))
rdd2: org.apache.spark.rdd.RDD[(Char, Int)] = ParallelCollectionRDD[61] at parallelize at <console>:24

scala> val j_rdd=rdd1.join(rdd2)
j_rdd: org.apache.spark.rdd.RDD[(Char, (Int, Int))] = MapPartitionsRDD[64] at join at <console>:28

scala> j_rdd.collect
res17: Array[(Char, (Int, Int))] = Array((a,(1,1)))

rightOuterJoin

确保第二个RDD的键必须存在(右外连接)

scala> val right_join=rdd1 rightOuterJoin rdd2
right_join: org.apache.spark.rdd.RDD[(Char, (Option[Int], Int))] = MapPartitionsRDD[67] at rightOuterJoin at <console>:28

scala> right_join.collect
res18: Array[(Char, (Option[Int], Int))] = Array((d,(None,4)), (e,(None,5)), (a,(Some(1),1)))

leftOuterJoin

确保第一个RDD的键必须存在(左外连接)

scala> val left_join=rdd1 leftOuterJoin rdd2
left_join: org.apache.spark.rdd.RDD[(Char, (Int, Option[Int]))] = MapPartitionsRDD[70] at leftOuterJoin at <console>:28

scala> left_join.collect
res19: Array[(Char, (Int, Option[Int]))] = Array((b,(2,None)), (a,(1,Some(1))), (c,(3,None)))

fullOuterJoin

对两个RDD进行全外连接

scala> val full_join=rdd1 fullOuterJoin rdd2
full_join: org.apache.spark.rdd.RDD[(Char, (Option[Int], Option[Int]))] = MapPartitionsRDD[73] at fullOuterJoin at <console>:28

scala> full_join.collect
res20: Array[(Char, (Option[Int], Option[Int]))] = Array((d,(None,Some(4))), (b,(Some(2),None)), (e,(None,Some(5))), (a,(Some(1),Some(1))), (c,(Some(3),None)))

使用zip组合两个RDD

要求两个RDD的partition数量以及元素数量都相同,否则会抛出异常

scala> var rdd1=sc.makeRDD(1 to 5,2)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[74] at makeRDD at <console>:24

scala> var rdd2=sc.makeRDD(Seq("A","B","C","D","E"),2)
rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[75] at makeRDD at <console>:24

scala> rdd1.zip(rdd2).collect
res21: Array[(Int, String)] = Array((1,A), (2,B), (3,C), (4,D), (5,E))

scala> rdd2.zip(rdd1).collect
res22: Array[(String, Int)] = Array((A,1), (B,2), (C,3), (D,4), (E,5))

使用combineByKey合并相同键的值

对一个含有多个相同键值对的数据求平均值

scala> val test=sc.parallelize(List(("panda",1),("panda",8),("pink",4),("pink",8),("pirate",5)))
test: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[79] at parallelize at <console>:24

scala> val cb_test=test.combineByKey(
     | count=>(count,1),
     | (acc:(Int,Int),count)=>(acc._1+count,acc._2+1),
     | (acc1:(Int,Int),acc2:(Int,Int))=>(acc1._1+acc2._1,acc1._2+acc2._2))
cb_test: org.apache.spark.rdd.RDD[(String, (Int, Int))] = ShuffledRDD[80] at combineByKey at <console>:26

scala> cb_test.map(x=>(x._1,x._2._1.toDouble/x._2._2)).collect
res24: Array[(String, Double)] = Array((panda,4.5), (pink,6.0), (pirate,5.0))

使用lookup查找指定键的值

scala> test.lookup("panda")
res25: Seq[Int] = WrappedArray(1, 8)

任务实现

scala> val bigdata=sc.textFile("/user/root/result_bigdata.txt").map{x=>val line=x.split("\t");(line(0),line(2).toInt)}
bigdata: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[86] at map at <console>:24

scala> val math=sc.textFile("/user/root/result_math.txt").map{x=>val line=x.split("\t");(line(0),line(2).toInt)}
math: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[89] at map at <console>:24

scala> val scores=bigdata.union(math).map(x=>(x._1,x._2))
scores: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[91] at map at <console>:28
scala> val cb_score=scores.combineByKey(
     | count=>(count,1),
     | (acc:(Int,Int),count)=>(acc._1+count,acc._2+1),
     | (acc1:(Int,Int),acc2:(Int,Int))=>(acc1._1+acc2._1,acc1._2+acc2._2))
cb_score: org.apache.spark.rdd.RDD[(String, (Int, Int))] = ShuffledRDD[96] at combineByKey at <console>:30

scala> val avg_score=cb_score.map(x=>(x._1,x._2._1.toDouble/x._2._2))
avg_score: org.apache.spark.rdd.RDD[(String, Double)] = MapPartitionsRDD[97] at map at <console>:32

scala> avg_score.collect
res30: Array[(String, Double)] = Array((1005,92.0), (1012,87.5), (1001,93.0), (1009,86.5), (1002,94.0), (1006,87.0), (1010,82.0), (1003,100.0), (1007,95.0), (1008,93.5), (1011,85.0), (1004,99.5))