1.开窗函数

row_number() over (partitin by XXX order by XXX) 同个分组内生成 连续的序号,每个分组内从 1 开始且排序相同的数据会标不同的号。
rank() over (partitin by XXX order by XXX) 同个分组内生成不连续的 序号,在每个分组内从 1 开始,同个分组内相同数据标号相同。
dense_rank() over (partitin by XXX order by XXX)同个分组内生成连 续的序号,在每个分组内从 1 开始,同个分组内相同数据标号相同,之后 的数据标号连续。

  1. 数据源
1 A 1
1 A 2
1 B 3
1 A 4
1 C 5
1 B 6
1 C 7
1 A 8
1 D 9
1 F 10
2 B 11
2 D 12
2 A 13
2 E 14
2 F 15
2 F 16
2 A 17
2 G 18
2 B 19
2 C 20
3 B 21
3 F 22
3 A 23
3 E 24
3 G 25
3 E 26
3 F 27
3 A 28
3 C 29
3 G 30
4 F 31
4 B 32
4 D 33
4 E 34
4 E 35
4 B 36
4 A 37
4 F 38
4 F 39
4 E 40
5 G 41
5 B 42
5 F 43
5 F 44
5 A 45
5 E 46
5 G 47
5 E 48
5 F 49
5 B 50
6 C 51
6 F 52
6 B 53
6 A 54
6 D 55
6 F 56
6 B 57
6 A 58
6 F 59
6 B 60
7 A 61
7 F 62
7 E 63
7 G 64
7 G 65
7 G 66
7 B 67
7 F 68
7 E 69
7 F 70
8 E 71
8 F 72
8 A 73
8 C 74
8 G 75
8 D 76
8 A 77
8 E 78
8 D 79
8 A 80
9 C 81
9 B 82
9 E 83
9 F 84
9 E 85
9 C 86
9 F 87
9 A 88
9 G 89
9 E 90
1 G 91
2 A 99
3 D 93
4 E 94
5 C 95
6 F 96
7 E 97
8 B 98
9 A 99
  1. 源码
package sparkSql

import org.apache.spark.sql.SparkSession

/**
* @Author yqq
* @Date 2021/12/14 16:00
* @Version 1.0
* 找出每个类别价格最高的前三名,并且找出对应的日期
* Over 开窗函数:
* row_number() over(partition by X1 order by X2 ) as rank
* 对表中的数据按照X1分组,按照X2排序,对每个分组内的数据标号,每个分组内的标号是连续的,标号在每个分组内从1开始。
* rank() over(partition by X1 order by X2 ) as rank
* 对表中的数据按照X1分组,按照X2排序,对每个分组内的数据标号,每个分组内的标号不连续且相同的数据标号相同,标号在每个分组内从1开始。
* dense_rank() over(partition by X1 order by X2 ) as rank
* 对表中的数据按照X1分组,按照X2排序,对每个分组内的数据标号,每个分组内的标号连续且相同的数据标号相同,标号在每个分组内从1开始。
*
* 1 A 100 ---1
* 2 A 20 ---2
* 7 A 18 ---3
* 9 A 4 ---4
* 4 B 500 ---1
* 2 B 200 ---2
* 8 B 300 ---3
* 5 B 50 ---4
*
**/
case class SaleInfo(dt:String,tp:String,pr:Double)
object SparkSQLOverFun {
def main(args: Array[String]): Unit = {
val session = SparkSession.builder().master("local").appName("test03").getOrCreate()
val context = session.sparkContext
context.setLogLevel("Error")
val lines = context.textFile("data/sales")
import session.implicits._
lines.map(line=>{
val arr = line.split("\t")
SaleInfo(arr(0),arr(1),arr(2).toDouble)
}).toDF().createTempView("t")

/**
* (A)dense_rank() over(partition by tp order by pr desc) as rank
*/
session.sql(
"""
|select
| dt,tp,pr,dense_rank() over(partition by tp order by pr desc) as rank
|from t order by tp
|""".stripMargin).show(10)

/**
* (B)rank() over(partition by tp order by pr desc) as rank
*/
session.sql(
"""
|select
| dt,tp,pr,rank() over(partition by tp order by pr desc) as rank
|from t order by tp
|""".stripMargin).show(10)

/**
* (C)row_number() over(partition by tp order by pr desc) as rank
*/
session.sql(
"""
|select dt,tp,pr from
|(select
| dt,tp,pr,row_number() over(partition by tp order by pr desc) as rank
|from t) temp
|where temp.rank <= 3
|""".stripMargin).show(100)
}
}

(A)

+---+---+----+----+
| dt| tp| pr|rank|
+---+---+----+----+
| 5| A|45.0| 9|
| 8| A|73.0| 5|
| 6| A|54.0| 8|
| 9| A|88.0| 2|
| 8| A|77.0| 4|
| 7| A|61.0| 6|
| 6| A|58.0| 7|
| 2| A|99.0| 1|
| 9| A|99.0| 1|
| 8| A|80.0| 3|
+---+---+----+----+

(B)

+---+---+----+----+
| dt| tp| pr|rank|
+---+---+----+----+
| 5| A|45.0| 10|
| 8| A|73.0| 6|
| 6| A|54.0| 9|
| 9| A|88.0| 3|
| 8| A|77.0| 5|
| 7| A|61.0| 7|
| 6| A|58.0| 8|
| 2| A|99.0| 1|
| 9| A|99.0| 1|
| 8| A|80.0| 4|
+---+---+----+----+
only showing top 10 rows

(C)

+---+---+----+
| dt| tp| pr|
+---+---+----+
| 6| F|96.0|
| 9| F|87.0|
| 9| F|84.0|
| 7| E|97.0|
| 4| E|94.0|
| 9| E|90.0|
| 8| B|98.0|
| 9| B|82.0|
| 7| B|67.0|
| 3| D|93.0|
| 8| D|79.0|
| 8| D|76.0|
| 5| C|95.0|
| 9| C|86.0|
| 9| C|81.0|
| 2| A|99.0|
| 9| A|99.0|
| 9| A|88.0|
| 1| G|91.0|
| 9| G|89.0|
| 8| G|75.0|
+---+---+----+

2.在本地操作Hive 使用开窗函数

  1. 因为操作Hive会把表以文件形式创建在HDFS上,所以必须在项目中加入此两个配置文件
    SparkSQL Over 开窗函数实战使用_sql
  2. 源码编写
package sparkSql

import org.apache.spark.sql.SparkSession

/**
* @Author yqq
* @Date 2021/12/14 18:00
* @Version 1.0
*/
object SparkSQLOverHive {
def main(args: Array[String]): Unit = {
val session = SparkSession.builder().master("local").config("hive.metastore.uris","thrift://192.168.134.103:9083").appName("t1").enableHiveSupport().getOrCreate()
session.sparkContext.setLogLevel("Error")
session.sql("use spark")
session.sql(
"""
|create table sales(dt string,tp string,pr double) row format delimited fields terminated by '\t'
|""".stripMargin)

session.sql(
"""
|load data local inpath "data/sales" into table sales
|""".stripMargin)

session.sql(
"""
|select dt,tp,pr,row_number() over(partition by tp order by pr desc) as rank
|from sales
|""".stripMargin).show()
}
}
+---+---+----+----+
| dt| tp| pr|rank|
+---+---+----+----+
| 6| F|96.0| 1|
| 9| F|87.0| 2|
| 9| F|84.0| 3|
| 8| F|72.0| 4|
| 7| F|70.0| 5|
| 7| F|68.0| 6|
| 7| F|62.0| 7|
| 6| F|59.0| 8|
| 6| F|56.0| 9|
| 6| F|52.0| 10|
| 5| F|49.0| 11|
| 5| F|44.0| 12|
| 5| F|43.0| 13|
| 4| F|39.0| 14|
| 4| F|38.0| 15|
| 4| F|31.0| 16|
| 3| F|27.0| 17|
| 3| F|22.0| 18|
| 2| F|16.0| 19|
| 2| F|15.0| 20|
+---+---+----+----+
only showing top 20 rows
  1. Hive端操作
```sql
hive> select * from sales;
OK
1 A 1.0
1 A 2.0
1 B 3.0
1 A 4.0
1 C 5.0
1 B 6.0
1 C 7.0
1 A 8.0
1 D 9.0
1 F 10.0
2 B 11.0
2 D 12.0
2 A 13.0
2 E 14.0
2 F 15.0
2 F 16.0
2 A 17.0
2 G 18.0
2 B 19.0
2 C 20.0
3 B 21.0
3 F 22.0
3 A 23.0
3 E 24.0
3 G 25.0
3 E 26.0
3 F 27.0
3 A 28.0
3 C 29.0
3 G 30.0
4 F 31.0
4 B 32.0
4 D 33.0
4 E 34.0
4 E 35.0
4 B 36.0
4 A 37.0
4 F 38.0
4 F 39.0
4 E 40.0
5 G 41.0
5 B 42.0
5 F 43.0
5 F 44.0
5 A 45.0
5 E 46.0
5 G 47.0
5 E 48.0
5 F 49.0
5 B 50.0
6 C 51.0
6 F 52.0
6 B 53.0
6 A 54.0
6 D 55.0
6 F 56.0
6 B 57.0
6 A 58.0
6 F 59.0
6 B 60.0
7 A 61.0
7 F 62.0
7 E 63.0
7 G 64.0
7 G 65.0
7 G 66.0
7 B 67.0
7 F 68.0
7 E 69.0
7 F 70.0
8 E 71.0
8 F 72.0
8 A 73.0
8 C 74.0
8 G 75.0
8 D 76.0
8 A 77.0
8 E 78.0
8 D 79.0
8 A 80.0
9 C 81.0
9 B 82.0
9 E 83.0
9 F 84.0
9 E 85.0
9 C 86.0
9 F 87.0
9 A 88.0
9 G 89.0
9 E 90.0
1 G 91.0
2 A 99.0
3 D 93.0
4 E 94.0
5 C 95.0
6 F 96.0
7 E 97.0
8 B 98.0
9 A 99.0
Time taken: 0.145 seconds, Fetched: 99 row(s)