窗口函数(window functions)与普通聚合函数(aggregation functions)的区别
普通聚合函数是将多行具备相同属性的字段聚合为一行,处理的结果中不会对处理前的明细呈现出来。
而窗口函数可以认为是sql 增强函数,可以既展示集合后的数据结果,又可以展示聚合前的数据明细。
下面一个简单的例子对比下:
//表结构+-----------+------------+----------+| col_name | data_type | comment |+-----------+------------+----------+| expodate | string | || addcode | string | || sbtid | string | || rscode | int | || rsname | string | || scid | int | |+-----------+------------+----------+
select addcode,count(distinct scid) scid_num from tb_hive_window group by addcode;+----------+-----------+| addcode | scid_num |+----------+-----------+| 0002 | 2 || 000201 | 1 || 000202 | 1 || 000205 | 3 || 000206 | 2 || 000208 | 2 |+----------+-----------+
select expodate,addcode,scid,count(distinct scid) over(partition by addcode)as scid_num from tb_hive_window;+----------------------+----------+--------+-----------+| expodate | addcode | scid | scid_num |+----------------------+----------+--------+-----------+| 2019-12-29 18:02:02 | 0002 | 36253 | 2 || 2019-12-29 18:02:01 | 0002 | 36251 | 2 || 2019-12-29 18:02:01 | 000201 | 36251 | 1 || 2019-12-29 18:02:04 | 000202 | 36258 | 1 || 2019-12-29 18:02:04 | 000205 | 37253 | 3 || 2019-12-29 18:02:04 | 000205 | 35253 | 3 || 2019-12-29 18:02:04 | 000205 | 36252 | 3 || 2019-12-29 18:02:04 | 000206 | 36258 | 2 || 2019-12-29 18:02:04 | 000206 | 38257 | 2 || 2019-12-29 18:02:04 | 000208 | 36257 | 2 || 2019-12-29 18:02:04 | 000208 | 36253 | 2 |+----------------------+----------+--------+-----------+
通过对比上面的例子可以看出,通过窗口查询到了更加详细的明细数据,因此,根据不同的应用场景需求我们需要选择不同的函数处理方案。
sql 执行顺序
一条完整的select sql 语句时的执行顺序如下:
from... where...group by... having.... select ... over()... order by... limit
可以看出,窗口函数是在查询字段select后,order by 之前执行
普通聚合函数+over()
我们可以通过聚合函数sum()/count()/avg()/max()/min() 实现简单的计算
select addcode,rscode,count(*) over(partition by rscode order by rscode desc) from tb_hive_window where substring(expodate,1,10) = "2019-12-29";select addcode,rscode,scid,avg(scid) over(partition by rscode order by rscode desc) from tb_hive_window where substring(expodate,1,10) = "2019-12-29";select addcode,rscode,scid,sum(scid) over(partition by rscode order by rscode desc) from tb_hive_window where substring(expodate,1,10) = "2019-12-29"+----------+---------+--------+---------------+| addcode | rscode | scid | sum_window_0 |+----------+---------+--------+---------------+| 000205 | 395 | 35253 | 35253 || 0002 | 352 | 36253 | 36253 || 000208 | 351 | 36257 | 36257 || 000206 | 65 | 36258 | 36258 || 000206 | 40 | 38257 | 38257 || 000202 | 35 | 36258 | 182267 || 000208 | 35 | 36253 | 182267 || 000205 | 35 | 37253 | 182267 || 000205 | 35 | 36252 | 182267 || 000201 | 35 | 36251 | 182267 || 0002 | 34 | 36251 | 36251 |+----------+---------+--------+---------------+ select addcode,rscode,scid,sum(scid) over(partition by rscode order by scid desc) from tb_hive_window where substring(expodate,1,10) = "2019-12-29"+----------+---------+--------+---------------+| addcode | rscode | scid | sum_window_0 |+----------+---------+--------+---------------+| 0002 | 34 | 36251 | 36251 || 000205 | 35 | 37253 | 37253 || 000202 | 35 | 36258 | 73511 || 000208 | 35 | 36253 | 109764 || 000205 | 35 | 36252 | 146016 || 000201 | 35 | 36251 | 182267 || 000206 | 40 | 38257 | 38257 || 000206 | 65 | 36258 | 36258 || 000208 | 351 | 36257 | 36257 || 0002 | 352 | 36253 | 36253 || 000205 | 395 | 35253 | 35253 |+----------+---------+--------+---------------+
需要特别注意下,对于over() 中的order by 问题,如果没有order by 排序或则是对partition 字段排序,那么sum 得到的结果是整合partition 中的此列所有值之和,如果是order by 非partition by字段,那么sum得到的是此partition中当前行此列与上一行此列值之和。
order by 非partition by字段
Analytics functions (分析函数)
row_number():分组内记录的顺序,从1开始,记录不重复
select addcode,rscode,scid,row_number() over(partition by rscode order by scid desc) from tb_hive_window where substring(expodate,1,10) = "2019-12-29";+----------+---------+--------+----------------------+| addcode | rscode | scid | row_number_window_0 |+----------+---------+--------+----------------------+| 0002 | 34 | 36251 | 1 || 000205 | 35 | 37253 | 1 || 000202 | 35 | 36258 | 2 || 000208 | 35 | 36253 | 3 || 000205 | 35 | 36252 | 4 || 000201 | 35 | 36251 | 5 || 000206 | 40 | 38257 | 1 || 000206 | 65 | 36258 | 1 || 000208 | 351 | 36257 | 1 || 0002 | 352 | 36253 | 1 || 000205 | 395 | 35253 | 1 |+----------+---------+--------+----------------------+
rank():分组中的排名,排名相等的将在下一个排名位置空出
dense_rank():分组中的排名,排名相等的将在下一个排名位置不会空出,即上图中排名为 1 2 2 3 4
cume_dist():分组中,小于或等于当前值的行数 / 分组中总行数
select addcode,rscode,scid,cume_dist() over(partition by rscode order by scid ) from tb_hive_window where substring(expodate,1,10) = "2019-12-29";+----------+---------+--------+---------------------+| addcode | rscode | scid | cume_dist_window_0 |+----------+---------+--------+---------------------+| 0002 | 34 | 36251 | 1.0 || 000201 | 35 | 36251 | 0.2 || 000205 | 35 | 36252 | 0.4 || 000208 | 35 | 36253 | 0.6 || 000202 | 35 | 36258 | 0.8 || 000205 | 35 | 37253 | 1.0 || 000206 | 40 | 38257 | 1.0 || 000206 | 65 | 36258 | 1.0 || 000208 | 351 | 36257 | 1.0 || 0002 | 352 | 36253 | 1.0 || 000205 | 395 | 35253 | 1.0 |+----------+---------+--------+---------------------+
percent_rank():分组内当前行的RANK值-1/分组内总行数-1
select addcode,rscode,scid,percent_rank() over(partition by rscode order by scid ) from tb_hive_window where substring(expodate,1,10) = "2019-12-29";+----------+---------+--------+------------------------+| addcode | rscode | scid | percent_rank_window_0 |+----------+---------+--------+------------------------+| 0002 | 34 | 36251 | 0.0 || 000201 | 35 | 36251 | 0.0 || 000205 | 35 | 36252 | 0.25 || 000208 | 35 | 36253 | 0.5 || 000202 | 35 | 36258 | 0.75 || 000205 | 35 | 37253 | 1.0 || 000206 | 40 | 38257 | 0.0 || 000206 | 65 | 36258 | 0.0 || 000208 | 351 | 36257 | 0.0 || 0002 | 352 | 36253 | 0.0 || 000205 | 395 | 35253 | 0.0 |+----------+---------+--------+------------------------+
lead()/lag() 函数
lag(field,num) :滞后,即分组中向上第num 行field 字段的值
select addcode,rscode,scid,lag(scid,1) over(partition by rscode order by scid ) from tb_hive_window where substring(expodate,1,10) = "2019-12-29";+----------+---------+--------+---------------+| addcode | rscode | scid | lag_window_0 |+----------+---------+--------+---------------+| 0002 | 34 | 36251 | NULL || 000201 | 35 | 36251 | NULL || 000205 | 35 | 36252 | 36251 || 000208 | 35 | 36253 | 36252 || 000202 | 35 | 36258 | 36253 || 000205 | 35 | 37253 | 36258 || 000206 | 40 | 38257 | NULL || 000206 | 65 | 36258 | NULL || 000208 | 351 | 36257 | NULL || 0002 | 352 | 36253 | NULL || 000205 | 395 | 35253 | NULL |+----------+---------+--------+---------------+
lead(field,num) :分组中向下第num 行field 字段的值
select addcode,rscode,scid,lead(scid,1) over(partition by rscode order by scid ) from tb_hive_window where substring(expodate,1,10) = "2019-12-29";+----------+---------+--------+----------------+| addcode | rscode | scid | lead_window_0 |+----------+---------+--------+----------------+| 0002 | 34 | 36251 | NULL || 000201 | 35 | 36251 | 36252 || 000205 | 35 | 36252 | 36253 || 000208 | 35 | 36253 | 36258 || 000202 | 35 | 36258 | 37253 || 000205 | 35 | 37253 | NULL || 000206 | 40 | 38257 | NULL || 000206 | 65 | 36258 | NULL || 000208 | 351 | 36257 | NULL || 0002 | 352 | 36253 | NULL || 000205 | 395 | 35253 | NULL |+----------+---------+--------+----------------+
first_value(): 分组中截止到当前行的第一个值
select addcode,rscode,scid,first_value(scid) over(partition by rscode order by scid ) from tb_hive_window where substring(expodate,1,10) = "2019-12-29";+----------+---------+--------+-----------------------+| addcode | rscode | scid | first_value_window_0 |+----------+---------+--------+-----------------------+| 0002 | 34 | 36251 | 36251 || 000201 | 35 | 36251 | 36251 || 000205 | 35 | 36252 | 36251 || 000208 | 35 | 36253 | 36251 || 000202 | 35 | 36258 | 36251 || 000205 | 35 | 37253 | 36251 || 000206 | 40 | 38257 | 38257 || 000206 | 65 | 36258 | 36258 || 000208 | 351 | 36257 | 36257 || 0002 | 352 | 36253 | 36253 || 000205 | 395 | 35253 | 35253 |+----------+---------+--------+-----------------------+
last_value():分组中截止到当前行的最后一个值
select addcode,rscode,scid,last_value(scid) over(partition by rscode order by scid ) from tb_hive_window where substring(expodate,1,10) = "2019-12-29";+----------+---------+--------+----------------------+| addcode | rscode | scid | last_value_window_0 |+----------+---------+--------+----------------------+| 0002 | 34 | 36251 | 36251 || 000201 | 35 | 36251 | 36251 || 000205 | 35 | 36252 | 36252 || 000208 | 35 | 36253 | 36253 || 000202 | 35 | 36258 | 36258 || 000205 | 35 | 37253 | 37253 || 000206 | 40 | 38257 | 38257 || 000206 | 65 | 36258 | 36258 || 000208 | 351 | 36257 | 36257 || 0002 | 352 | 36253 | 36253 || 000205 | 395 | 35253 | 35253 |+----------+---------+--------+----------------------+
window 子句(控制窗口的子集)
preceding: 往前,例如 2 preceding 往前两行
current row:当前行
following:往后,例如 3 following 往后3行
unbounded preceding: 窗口最前面的行
unbounded following: 窗口最后一行
例如:
//前两行到当前行之间select addcode,rscode,scid,last_value(scid) over(partition by rscode order by scid rows between 2 preceding and current row) from tb_hive_window where substring(expodate,1,10) = "2019-12-29";//分组起点到当前行(如果不使用windwow子句,默认是从起点到当前行)select addcode,rscode,scid,last_value(scid) over(partition by rscode order by scid rows between unbounded preceding and current row) from tb_hive_window where substring(expodate,1,10) = "2019-12-29";//当前行当分组最后一行select addcode,rscode,scid,last_value(scid) over(partition by rscode order by scid rows between current row and unbounded following) from tb_hive_window where substring(expodate,1,10) = "2019-12-29";