hive from values语法 hive show functions_hive


窗口函数(window functions)与普通聚合函数(aggregation functions)的区别

普通聚合函数是将多行具备相同属性的字段聚合为一行,处理的结果中不会对处理前的明细呈现出来。

而窗口函数可以认为是sql 增强函数,可以既展示集合后的数据结果,又可以展示聚合前的数据明细。

下面一个简单的例子对比下:

//表结构+-----------+------------+----------+| col_name  | data_type  | comment  |+-----------+------------+----------+| expodate  | string     |          || addcode   | string     |          || sbtid     | string     |          || rscode    | int        |          || rsname    | string     |          || scid      | int        |          |+-----------+------------+----------+
select addcode,count(distinct scid) scid_num from tb_hive_window group by addcode;+----------+-----------+| addcode  | scid_num  |+----------+-----------+| 0002     | 2         || 000201   | 1         || 000202   | 1         || 000205   | 3         || 000206   | 2         || 000208   | 2         |+----------+-----------+
select expodate,addcode,scid,count(distinct scid) over(partition by addcode)as scid_num from tb_hive_window;+----------------------+----------+--------+-----------+|       expodate       | addcode  |  scid  | scid_num  |+----------------------+----------+--------+-----------+| 2019-12-29 18:02:02  | 0002     | 36253  | 2         || 2019-12-29 18:02:01  | 0002     | 36251  | 2         || 2019-12-29 18:02:01  | 000201   | 36251  | 1         || 2019-12-29 18:02:04  | 000202   | 36258  | 1         || 2019-12-29 18:02:04  | 000205   | 37253  | 3         || 2019-12-29 18:02:04  | 000205   | 35253  | 3         || 2019-12-29 18:02:04  | 000205   | 36252  | 3         || 2019-12-29 18:02:04  | 000206   | 36258  | 2         || 2019-12-29 18:02:04  | 000206   | 38257  | 2         || 2019-12-29 18:02:04  | 000208   | 36257  | 2         || 2019-12-29 18:02:04  | 000208   | 36253  | 2         |+----------------------+----------+--------+-----------+

通过对比上面的例子可以看出,通过窗口查询到了更加详细的明细数据,因此,根据不同的应用场景需求我们需要选择不同的函数处理方案。

sql 执行顺序

一条完整的select sql 语句时的执行顺序如下:

from... where...group by... having.... select ... over()... order by... limit

可以看出,窗口函数是在查询字段select后,order by 之前执行

普通聚合函数+over()

我们可以通过聚合函数sum()/count()/avg()/max()/min() 实现简单的计算

select addcode,rscode,count(*) over(partition by rscode order by rscode desc) from tb_hive_window where substring(expodate,1,10) = "2019-12-29";select addcode,rscode,scid,avg(scid) over(partition by rscode order by rscode desc) from tb_hive_window where substring(expodate,1,10) = "2019-12-29";select addcode,rscode,scid,sum(scid) over(partition by rscode order by rscode desc) from tb_hive_window where substring(expodate,1,10) = "2019-12-29"+----------+---------+--------+---------------+| addcode  | rscode  |  scid  | sum_window_0  |+----------+---------+--------+---------------+| 000205   | 395     | 35253  | 35253         || 0002     | 352     | 36253  | 36253         || 000208   | 351     | 36257  | 36257         || 000206   | 65      | 36258  | 36258         || 000206   | 40      | 38257  | 38257         || 000202   | 35      | 36258  | 182267        || 000208   | 35      | 36253  | 182267        || 000205   | 35      | 37253  | 182267        || 000205   | 35      | 36252  | 182267        || 000201   | 35      | 36251  | 182267        || 0002     | 34      | 36251  | 36251         |+----------+---------+--------+---------------+ select addcode,rscode,scid,sum(scid) over(partition by rscode order by scid desc) from tb_hive_window where substring(expodate,1,10) = "2019-12-29"+----------+---------+--------+---------------+| addcode  | rscode  |  scid  | sum_window_0  |+----------+---------+--------+---------------+| 0002     | 34      | 36251  | 36251         || 000205   | 35      | 37253  | 37253         || 000202   | 35      | 36258  | 73511         || 000208   | 35      | 36253  | 109764        || 000205   | 35      | 36252  | 146016        || 000201   | 35      | 36251  | 182267        || 000206   | 40      | 38257  | 38257         || 000206   | 65      | 36258  | 36258         || 000208   | 351     | 36257  | 36257         || 0002     | 352     | 36253  | 36253         || 000205   | 395     | 35253  | 35253         |+----------+---------+--------+---------------+

需要特别注意下,对于over() 中的order by 问题,如果没有order by 排序或则是对partition 字段排序,那么sum 得到的结果是整合partition 中的此列所有值之和,如果是order by 非partition by字段,那么sum得到的是此partition中当前行此列与上一行此列值之和。


hive from values语法 hive show functions_ci_02

order by 非partition by字段


Analytics functions (分析函数)

row_number():分组内记录的顺序,从1开始,记录不重复

select addcode,rscode,scid,row_number() over(partition by rscode order by scid desc) from tb_hive_window where substring(expodate,1,10) = "2019-12-29";+----------+---------+--------+----------------------+| addcode  | rscode  |  scid  | row_number_window_0  |+----------+---------+--------+----------------------+| 0002     | 34      | 36251  | 1                    || 000205   | 35      | 37253  | 1                    || 000202   | 35      | 36258  | 2                    || 000208   | 35      | 36253  | 3                    || 000205   | 35      | 36252  | 4                    || 000201   | 35      | 36251  | 5                    || 000206   | 40      | 38257  | 1                    || 000206   | 65      | 36258  | 1                    || 000208   | 351     | 36257  | 1                    || 0002     | 352     | 36253  | 1                    || 000205   | 395     | 35253  | 1                    |+----------+---------+--------+----------------------+

rank():分组中的排名,排名相等的将在下一个排名位置空出


hive from values语法 hive show functions_hive from values语法_03


dense_rank():分组中的排名,排名相等的将在下一个排名位置不会空出,即上图中排名为 1 2 2 3 4

cume_dist():分组中,小于或等于当前值的行数 / 分组中总行数

select addcode,rscode,scid,cume_dist() over(partition by rscode order by scid ) from tb_hive_window where substring(expodate,1,10) = "2019-12-29";+----------+---------+--------+---------------------+| addcode  | rscode  |  scid  | cume_dist_window_0  |+----------+---------+--------+---------------------+| 0002     | 34      | 36251  | 1.0                 || 000201   | 35      | 36251  | 0.2                 || 000205   | 35      | 36252  | 0.4                 || 000208   | 35      | 36253  | 0.6                 || 000202   | 35      | 36258  | 0.8                 || 000205   | 35      | 37253  | 1.0                 || 000206   | 40      | 38257  | 1.0                 || 000206   | 65      | 36258  | 1.0                 || 000208   | 351     | 36257  | 1.0                 || 0002     | 352     | 36253  | 1.0                 || 000205   | 395     | 35253  | 1.0                 |+----------+---------+--------+---------------------+

percent_rank():分组内当前行的RANK值-1/分组内总行数-1

select addcode,rscode,scid,percent_rank() over(partition by rscode order by scid ) from tb_hive_window where substring(expodate,1,10) = "2019-12-29";+----------+---------+--------+------------------------+| addcode  | rscode  |  scid  | percent_rank_window_0  |+----------+---------+--------+------------------------+| 0002     | 34      | 36251  | 0.0                    || 000201   | 35      | 36251  | 0.0                    || 000205   | 35      | 36252  | 0.25                   || 000208   | 35      | 36253  | 0.5                    || 000202   | 35      | 36258  | 0.75                   || 000205   | 35      | 37253  | 1.0                    || 000206   | 40      | 38257  | 0.0                    || 000206   | 65      | 36258  | 0.0                    || 000208   | 351     | 36257  | 0.0                    || 0002     | 352     | 36253  | 0.0                    || 000205   | 395     | 35253  | 0.0                    |+----------+---------+--------+------------------------+

lead()/lag() 函数

lag(field,num) :滞后,即分组中向上第num 行field 字段的值

select addcode,rscode,scid,lag(scid,1) over(partition by rscode order by scid ) from tb_hive_window where substring(expodate,1,10) = "2019-12-29";+----------+---------+--------+---------------+| addcode  | rscode  |  scid  | lag_window_0  |+----------+---------+--------+---------------+| 0002     | 34      | 36251  | NULL          || 000201   | 35      | 36251  | NULL          || 000205   | 35      | 36252  | 36251         || 000208   | 35      | 36253  | 36252         || 000202   | 35      | 36258  | 36253         || 000205   | 35      | 37253  | 36258         || 000206   | 40      | 38257  | NULL          || 000206   | 65      | 36258  | NULL          || 000208   | 351     | 36257  | NULL          || 0002     | 352     | 36253  | NULL          || 000205   | 395     | 35253  | NULL          |+----------+---------+--------+---------------+

lead(field,num) :分组中向下第num 行field 字段的值

select addcode,rscode,scid,lead(scid,1) over(partition by rscode order by scid ) from tb_hive_window where substring(expodate,1,10) = "2019-12-29";+----------+---------+--------+----------------+| addcode  | rscode  |  scid  | lead_window_0  |+----------+---------+--------+----------------+| 0002     | 34      | 36251  | NULL           || 000201   | 35      | 36251  | 36252          || 000205   | 35      | 36252  | 36253          || 000208   | 35      | 36253  | 36258          || 000202   | 35      | 36258  | 37253          || 000205   | 35      | 37253  | NULL           || 000206   | 40      | 38257  | NULL           || 000206   | 65      | 36258  | NULL           || 000208   | 351     | 36257  | NULL           || 0002     | 352     | 36253  | NULL           || 000205   | 395     | 35253  | NULL           |+----------+---------+--------+----------------+

first_value(): 分组中截止到当前行的第一个值

select addcode,rscode,scid,first_value(scid) over(partition by rscode order by scid ) from tb_hive_window where substring(expodate,1,10) = "2019-12-29";+----------+---------+--------+-----------------------+| addcode  | rscode  |  scid  | first_value_window_0  |+----------+---------+--------+-----------------------+| 0002     | 34      | 36251  | 36251                 || 000201   | 35      | 36251  | 36251                 || 000205   | 35      | 36252  | 36251                 || 000208   | 35      | 36253  | 36251                 || 000202   | 35      | 36258  | 36251                 || 000205   | 35      | 37253  | 36251                 || 000206   | 40      | 38257  | 38257                 || 000206   | 65      | 36258  | 36258                 || 000208   | 351     | 36257  | 36257                 || 0002     | 352     | 36253  | 36253                 || 000205   | 395     | 35253  | 35253                 |+----------+---------+--------+-----------------------+

last_value():分组中截止到当前行的最后一个值

select addcode,rscode,scid,last_value(scid) over(partition by rscode order by scid ) from tb_hive_window where substring(expodate,1,10) = "2019-12-29";+----------+---------+--------+----------------------+| addcode  | rscode  |  scid  | last_value_window_0  |+----------+---------+--------+----------------------+| 0002     | 34      | 36251  | 36251                || 000201   | 35      | 36251  | 36251                || 000205   | 35      | 36252  | 36252                || 000208   | 35      | 36253  | 36253                || 000202   | 35      | 36258  | 36258                || 000205   | 35      | 37253  | 37253                || 000206   | 40      | 38257  | 38257                || 000206   | 65      | 36258  | 36258                || 000208   | 351     | 36257  | 36257                || 0002     | 352     | 36253  | 36253                || 000205   | 395     | 35253  | 35253                |+----------+---------+--------+----------------------+

window 子句(控制窗口的子集)

preceding: 往前,例如 2 preceding 往前两行

current row:当前行

following:往后,例如 3 following 往后3行

unbounded preceding: 窗口最前面的行

unbounded following: 窗口最后一行

例如:

//前两行到当前行之间select addcode,rscode,scid,last_value(scid) over(partition by rscode order by scid rows between 2 preceding and current row) from tb_hive_window where substring(expodate,1,10) = "2019-12-29";//分组起点到当前行(如果不使用windwow子句,默认是从起点到当前行)select addcode,rscode,scid,last_value(scid) over(partition by rscode order by scid rows between unbounded preceding and current row) from tb_hive_window where substring(expodate,1,10) = "2019-12-29";//当前行当分组最后一行select addcode,rscode,scid,last_value(scid) over(partition by rscode order by scid rows between current row and unbounded following) from tb_hive_window where substring(expodate,1,10) = "2019-12-29";

Hive Sql 窗口函数暂时就记这么多了,后面再说说Hive Sql 的常用函数和group by 分组处理