目录

一、窗口函数种类

二、具体用法如下

三、数据及程序

四、结果展示

一、窗口函数种类

  1. ranking 排名类
  2. analytic 分析类
  3. aggregate 聚合类

Function Type

SQL

DataFrame API

Description

Ranking

rank

rank

rank值可能是不连续的

Ranking

dense_rank

denseRank

rank值一定是连续的

Ranking

percent_rank

percentRank

相同的分组中 (rank -1) / ( count(score) - 1 )

Ranking

ntile

ntile

将同一组数据循环的往n个桶中放,返回对应的桶的index,index从1开始

Ranking

row_number

rowNumber

很单纯的行号,类似excel的行号

Analytic

cume_dist

cumeDist

Analytic

first_value

firstValue

相同的分组中最小值

Analytic

last_value

lastValue

相同的分组中最大值

Analytic

lag

lag

取前n行数据

Analytic

lead

lead

取后n行数据

Aggregate

min

min

最小值

Aggregate

max

max

最大值

Aggregate

sum

sum

求和

Aggregate

avg

avg

求平均

二、具体用法如下

count(...) over(partition by ... order by ...)--求分组后的总数。
sum(...) over(partition by ... order by ...)--求分组后的和。
max(...) over(partition by ... order by ...)--求分组后的最大值。
min(...) over(partition by ... order by ...)--求分组后的最小值。
avg(...) over(partition by ... order by ...)--求分组后的平均值。
rank() over(partition by ... order by ...)--rank值可能是不连续的。
dense_rank() over(partition by ... order by ...)--rank值是连续的。
first_value(...) over(partition by ... order by ...)--求分组内的第一个值。
last_value(...) over(partition by ... order by ...)--求分组内的最后一个值。
lag() over(partition by ... order by ...)--取出前n行数据。  
lead() over(partition by ... order by ...)--取出后n行数据。
ratio_to_report() over(partition by ... order by ...)--Ratio_to_report() 括号中就是分子,over() 括号中就是分母。
percent_rank() over(partition by ... order by ...)--

三、数据及程序

案例数据:doc/day01/student.json,学生名字、课程、分数

{"name":"A","lesson":"Math","score":100}
{"name":"B","lesson":"Math","score":100}
{"name":"C","lesson":"Math","score":99}
{"name":"D","lesson":"Math","score":98}
{"name":"A","lesson":"E","score":100}
{"name":"B","lesson":"E","score":99}
{"name":"C","lesson":"E","score":99}
{"name":"D","lesson":"E","score":98}
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.3.3</version>
</dependency>
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.{DataFrame, SparkSession}

/**
* @author: 余辉

* @create: 2019-12-21 15:29
* @description:
**/
object calculate {

Logger.getLogger("org").setLevel(Level.ERROR)

def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("calculate")
.master("local[*]")
.getOrCreate()

import spark.implicits._
val student: DataFrame = spark.read.json("doc/day01/student.json")
student.createTempView("stu")

spark.sql(
"""
|select
| name,lesson,score,
| ntile(2) over (partition by lesson order by score desc ) as ntile_2,
| ntile(3) over (partition by lesson order by score desc ) as ntile_3,
| row_number() over (partition by lesson order by score desc ) as row_number,
| rank() over (partition by lesson order by score desc ) as rank,
| dense_rank() over (partition by lesson order by score desc ) as dense_rank,
| percent_rank() over (partition by lesson order by score desc ) as percent_rank
|from stu
|order by lesson,name,score
|
|""".stripMargin).show()
}

}

四、结果展示

+----+------+-----+-------+-------+----------+----+----------+------------------+
|name|lesson|score|ntile_2|ntile_3|row_number|rank|dense_rank| percent_rank|
+----+------+-----+-------+-------+----------+----+----------+------------------+
| A| E| 100| 1| 1| 1| 1| 1| 0.0|
| B| E| 99| 1| 1| 2| 2| 2|0.3333333333333333|
| C| E| 99| 2| 2| 3| 2| 2|0.3333333333333333|
| D| E| 98| 2| 3| 4| 4| 3| 1.0|
| A| Math| 100| 1| 1| 1| 1| 1| 0.0|
| B| Math| 100| 1| 1| 2| 1| 1| 0.0|
| C| Math| 99| 2| 2| 3| 3| 2|0.6666666666666666|
| D| Math| 98| 2| 3| 4| 4| 3| 1.0|
+----+------+-----+-------+-------+----------+----+----------+------------------+


北京小辉微信公众号

Spark深入解读(九)---- 窗口函数和API集合_Math


大数据资料分享请关注