Spark深入解读(九)---- 窗口函数和API集合
原创
©著作权归作者所有:来自51CTO博客作者北京小辉的原创作品,请联系作者获取转载授权,否则将追究法律责任
目录
一、窗口函数种类
二、具体用法如下
三、数据及程序
四、结果展示
一、窗口函数种类
- ranking 排名类
- analytic 分析类
- aggregate 聚合类
Function Type
| SQL
| DataFrame API
| Description
|
Ranking
| rank
| rank
| rank值可能是不连续的
|
Ranking
| dense_rank
| denseRank
| rank值一定是连续的
|
Ranking
| percent_rank
| percentRank
| 相同的分组中 (rank -1) / ( count(score) - 1 )
|
Ranking
| ntile
| ntile
| 将同一组数据循环的往n个桶中放,返回对应的桶的index,index从1开始
|
Ranking
| row_number
| rowNumber
| 很单纯的行号,类似excel的行号
|
Analytic
| cume_dist
| cumeDist
| |
Analytic
| first_value
| firstValue
| 相同的分组中最小值
|
Analytic
| last_value
| lastValue
| 相同的分组中最大值
|
Analytic
| lag
| lag
| 取前n行数据
|
Analytic
| lead
| lead
| 取后n行数据
|
Aggregate
| min
| min
| 最小值
|
Aggregate
| max
| max
| 最大值
|
Aggregate
| sum
| sum
| 求和
|
Aggregate
| avg
| avg
| 求平均
|
二、具体用法如下
count(...) over(partition by ... order by ...)--求分组后的总数。
sum(...) over(partition by ... order by ...)--求分组后的和。
max(...) over(partition by ... order by ...)--求分组后的最大值。
min(...) over(partition by ... order by ...)--求分组后的最小值。
avg(...) over(partition by ... order by ...)--求分组后的平均值。
rank() over(partition by ... order by ...)--rank值可能是不连续的。
dense_rank() over(partition by ... order by ...)--rank值是连续的。
first_value(...) over(partition by ... order by ...)--求分组内的第一个值。
last_value(...) over(partition by ... order by ...)--求分组内的最后一个值。
lag() over(partition by ... order by ...)--取出前n行数据。
lead() over(partition by ... order by ...)--取出后n行数据。
ratio_to_report() over(partition by ... order by ...)--Ratio_to_report() 括号中就是分子,over() 括号中就是分母。
percent_rank() over(partition by ... order by ...)--
三、数据及程序
案例数据:doc/day01/student.json,学生名字、课程、分数
{"name":"A","lesson":"Math","score":100}
{"name":"B","lesson":"Math","score":100}
{"name":"C","lesson":"Math","score":99}
{"name":"D","lesson":"Math","score":98}
{"name":"A","lesson":"E","score":100}
{"name":"B","lesson":"E","score":99}
{"name":"C","lesson":"E","score":99}
{"name":"D","lesson":"E","score":98}
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.3.3</version>
</dependency>
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.{DataFrame, SparkSession}
/**
* @author: 余辉
、
* @create: 2019-12-21 15:29
* @description:
**/
object calculate {
Logger.getLogger("org").setLevel(Level.ERROR)
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("calculate")
.master("local[*]")
.getOrCreate()
import spark.implicits._
val student: DataFrame = spark.read.json("doc/day01/student.json")
student.createTempView("stu")
spark.sql(
"""
|select
| name,lesson,score,
| ntile(2) over (partition by lesson order by score desc ) as ntile_2,
| ntile(3) over (partition by lesson order by score desc ) as ntile_3,
| row_number() over (partition by lesson order by score desc ) as row_number,
| rank() over (partition by lesson order by score desc ) as rank,
| dense_rank() over (partition by lesson order by score desc ) as dense_rank,
| percent_rank() over (partition by lesson order by score desc ) as percent_rank
|from stu
|order by lesson,name,score
|
|""".stripMargin).show()
}
}
四、结果展示
+----+------+-----+-------+-------+----------+----+----------+------------------+
|name|lesson|score|ntile_2|ntile_3|row_number|rank|dense_rank| percent_rank|
+----+------+-----+-------+-------+----------+----+----------+------------------+
| A| E| 100| 1| 1| 1| 1| 1| 0.0|
| B| E| 99| 1| 1| 2| 2| 2|0.3333333333333333|
| C| E| 99| 2| 2| 3| 2| 2|0.3333333333333333|
| D| E| 98| 2| 3| 4| 4| 3| 1.0|
| A| Math| 100| 1| 1| 1| 1| 1| 0.0|
| B| Math| 100| 1| 1| 2| 1| 1| 0.0|
| C| Math| 99| 2| 2| 3| 3| 2|0.6666666666666666|
| D| Math| 98| 2| 3| 4| 4| 3| 1.0|
+----+------+-----+-------+-------+----------+----+----------+------------------+
北京小辉微信公众号
大数据资料分享请关注