Hive 窗口函数和分析函数- Windowing and Analytics Functions

1.前言

使用 Hive 或 Impala 做数据查询或探索的时候,经常会使用到窗口函数/分析函数,做数据处理的时候也会经常用到窗口函数来做分组计算或者数据去重。Hive 的窗口函数、分析函数有很多,可能日常工作中习惯了使用某一类窗口函数,其他的功能有所忽略,那么今天我们便来一起回顾和学习 Hive 的窗口函数/分析函数吧~

2.窗口函数和分析函数分类

hive 窗口函数ROLLUP hive 窗口函数执行顺序_分析函数

3.数据说明

为了让示例更加通俗易懂,选取了 2019-2020 赛季3个欧洲足球联赛(英超、西甲、意甲) 球队积分榜前5名词的球队数据,总共15行,一目了然。数据如下:

SELECT * FROM league_scores ORDER BY league, scores DESC;

SELECT * FROM league_scores ORDER BY league, scores DESC;

hive 窗口函数ROLLUP hive 窗口函数执行顺序_窗口函数_02

下面我们便逐一使用窗口函数和分析函数来一探究竟吧~

4 窗口函数之计算功能

聚合函数可以配合窗口语句 OVER() 来使用,达到分组统计的效果。不同点是聚合函数只返回每组的一行聚合计算结果,而窗口函数是返回所有的行。

OVER(PARTITION BY [columns] ORDER BY [columns])

OVER 窗口子句(WINDOW clauses):

  • PRECEDING:往前
  • FOLLOWING:往后
  • CURRENT ROW:当前行
  • UNBOUNDED:起点 结合PRECEDING,FOLLOWING使用
  • UNBOUNDED PRECEDING 表示该窗口最前面的行(起点)
  • UNBOUNDED FOLLOWING:表示该窗口最后面的行(终点)

当配置了 ORDER BY,之后的窗口子句, 默认是 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW:表示从起点到当前行;

当 ORDER BY 没有配置,且无窗口子句,则默认的是  ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING:表示从起点到终点。

我们以求和 sum() 函数为例,来看看不同的变化, 三大联赛按积分从高至低排序来做计算:

SELECT league, club, scores,
       -- 截至当前行的积分总和
       SUM(scores) OVER(ORDER BY scores desc) as scores_total,
       -- 从起点至当前行, 结果同 scores_total, 动态的累加和
       SUM(scores) OVER(ORDER BY scores desc
                        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS scores_total_dynamic,
       -- 从起点到终点, 静态的总和
       SUM(scores) OVER(ORDER BY scores desc
                        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS scores_total_static,
       -- 往前2行 + 当前行 + 往后1行的积分总和
       SUM(scores) OVER(ORDER BY scores desc
                        ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING) AS scores_total_2
FROM league_scores

SELECT league, club, scores,
       -- 截至当前行的积分总和
       SUM(scores) OVER(ORDER BY scores desc) as scores_total,
       -- 从起点至当前行, 结果同 scores_total, 动态的累加和
       SUM(scores) OVER(ORDER BY scores desc
                        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS scores_total_dynamic,
       -- 从起点到终点, 静态的总和
       SUM(scores) OVER(ORDER BY scores desc
                        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS scores_total_static,
       -- 往前2行 + 当前行 + 往后1行的积分总和
       SUM(scores) OVER(ORDER BY scores desc
                        ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING) AS scores_total_2
FROM league_scores

截取前5行如下:可看到不加窗口子句的 score_total 和 scores_total_dynamic 都是动态的累加和,值是相同的,scores_total_static 则是静态累计和,是所有球队积分求和的一个固定值。而score_total_2 是我们自定义的累加和,ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING 代表:往前2行 + 当前行 + 往后1行的积分总和,以第3行为例,尤文的前两名,利物浦 99 + 皇马 87,再加尤文后面的国米 82,再加上自己 83,总积分和为 351,没错!

hive 窗口函数ROLLUP hive 窗口函数执行顺序_窗口函数_03

5 窗口函数之取值功能

5.1 FIRST_VALUE()

first_value(expr) OVER([partition_by_clause] order_by_clause [window_clause]) 

返回窗口中的第一个值,跟 ORDER BY 的条件有关,我们来求每个联赛分数从高至低排序出现的第一个值:

SELECT league, club, scores,
       FIRST_VALUE(scores) OVER(PARTITION BY league ORDER BY scores DESC) AS first_value
FROM league_scores
ORDER BY league, scores DESC;

SELECT league, club, scores,
       FIRST_VALUE(scores) OVER(PARTITION BY league ORDER BY scores DESC) AS first_value
FROM league_scores
ORDER BY league, scores DESC;

hive 窗口函数ROLLUP hive 窗口函数执行顺序_hive 按指定顺序排序_04

5.2 LAST_VALUE()

last_value(expr) OVER([partition_by_clause] order_by_clause [window_clause]) last_value() 

求窗口范围内满足条件的最后一个值,跟 first_value() 不同的是, 不指定窗口子句,last_value() 返回的是窗口内动态的最后一个值,我们来求每个联赛积分从少到多顺序排序的最后一个值

SELECT league, club, scores,
       LAST_VALUE(scores) OVER(PARTITION BY league ORDER BY scores) AS last_value_1
FROM league_scores;

SELECT league, club, scores,
       LAST_VALUE(scores) OVER(PARTITION BY league ORDER BY scores) AS last_value_1
FROM league_scores;

可以看到 last_value_1 的值和 score 是一样的,是表示动态的截至当前最后一个值

hive 窗口函数ROLLUP hive 窗口函数执行顺序_分析函数_05

如果我需要的是一个窗口内最后一个固定的值来做相关的后续计算呢?由窗口子句的表达式可知道,我们需要再 ORDER BY 后加 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING 语句来保持静态

-- 静态 last_value()
SELECT league, club, scores,
       LAST_VALUE(scores) OVER(PARTITION BY league ORDER BY scores
                               ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_value
FROM league_scores;

-- 静态 last_value()
SELECT league, club, scores,
       LAST_VALUE(scores) OVER(PARTITION BY league ORDER BY scores
                               ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_value
FROM league_scores;

可以看到这次的 last_value 是固定的积分最多的一个值了。

hive 窗口函数ROLLUP hive 窗口函数执行顺序_hive 按指定顺序排序_06

6窗口函数之排序功能

窗口函数来做排序应该是我们日常用得最多的一种功能,在给定的窗口内,按给定条件排序,返回排序后的序号,通常用做分组排序和数据去重使用。

6.1 ROW_NUMBER()

返回不重复的从1开始的行号,每个联赛按积分从高至低排序,积分相同也会不会出现行号的跳跃。

SELECT league, club, scores, ROW_NUMBER() OVER(PARTITION BY league ORDER BY scores desc) AS _rnk
FROM league_scores
ORDER BY league;

SELECT league, club, scores, ROW_NUMBER() OVER(PARTITION BY league ORDER BY scores desc) AS _rnk
FROM league_scores
ORDER BY league;

hive 窗口函数ROLLUP hive 窗口函数执行顺序_窗口函数_07

6.2 RANK()

返回每个窗口范围内按给定条件排序的名次,会出现并列的名次,比如曼联和切尔西同积66分并列联赛第3名,那么下一个62分的莱斯特城便会当成第5名,所以排序号是稀疏的。

SELECT league, club, scores, RANK() OVER(PARTITION BY league ORDER BY scores desc) AS _rnk
FROM league_scores;

SELECT league, club, scores, RANK() OVER(PARTITION BY league ORDER BY scores desc) AS _rnk
FROM league_scores;

hive 窗口函数ROLLUP hive 窗口函数执行顺序_hive 按指定顺序排序_08

6.3 DENSE_RANK()

假如说我希望是连续的排名,比如上面莱斯特城我希望在曼联和切尔西并列第3名之后,可以从第4名开始算排名,那么这时候便可以用 dense_rank(),密集窗口排序函数来解决

SELECT league, club, scores, DENSE_RANK() OVER(PARTITION BY league ORDER BY scores desc) AS _dense_rnk
FROM league_scores;

SELECT league, club, scores, DENSE_RANK() OVER(PARTITION BY league ORDER BY scores desc) AS _dense_rnk
FROM league_scores;

hive 窗口函数ROLLUP hive 窗口函数执行顺序_窗口函数_09

6.4 NTILE() 数据分桶切片函数

ntile() 可以做数据分片使用,平时自己用的不多,但在一些场景下还是很方便,比如一批数据需要按某一个度量来分成N片,取第M片。举个例子,三大联赛按积分从高至低排序分成3片,来区分三大联赛第一梯队的球队有哪些,某个球队落在第几梯队。

SELECT league, club, scores,
       NTILE(3) OVER(ORDER BY scores desc) AS rnk_3
FROM league_scores
ORDER BY rnk_3, scores desc;

SELECT league, club, scores,
       NTILE(3) OVER(ORDER BY scores desc) AS rnk_3
FROM league_scores
ORDER BY rnk_3, scores desc;

可以看到,按积分排名,排名前三分之一的第一梯队由英超利物浦一骑绝尘,西甲皇马巴萨和意甲尤文国米分占两席。同样选择 rnk=2 或 rnk=3 的可以选择查看中间及最后三分之一的球队。

hive 窗口函数ROLLUP hive 窗口函数执行顺序_分析函数_10

7 窗口函数之序功能

7.1 PERCENT_RANK() 排名百分比函数

percent_rank(T expr) OVER ([partition_by_clause] order_by_clause)

percent_rank 的计算逻辑为 (rank - 1) / (rows_in_group - 1) 我们按三大联赛积分从高至低排序,用 percent_rank 可以查询到每支球队的积分排名的百分比

SELECT a.league, club, scores, _rnk,
       (a._rnk - 1) / (b.rows_in_group - 1) AS percent_rank_calculate, 
       PERCENT_RANK() OVER(ORDER BY a.scores desc) AS percent_rank_use_func
FROM (
    SELECT league, club, scores,
           RANK() OVER(ORDER BY scores DESC) AS _rnk
    FROM league_scores
    ) a,
    (
    SELECT COUNT(*) AS rows_in_group FROM league_scores
    ) b;

SELECT a.league, club, scores, _rnk,
       (a._rnk - 1) / (b.rows_in_group - 1) AS percent_rank_calculate, 
       PERCENT_RANK() OVER(ORDER BY a.scores desc) AS percent_rank_use_func
FROM (
    SELECT league, club, scores,
           RANK() OVER(ORDER BY scores DESC) AS _rnk
    FROM league_scores
    ) a,
    (
    SELECT COUNT(*) AS rows_in_group FROM league_scores
    ) b;

_rnk 为积分排名,percent_rank_calculate 为我们自己按照公式计算的百分比,percent_rank_use_func 为函数计算的百分比,可看到两者值的相同的。

具体值的意义,比如国米和巴萨同积82分,是在三大联赛数据里排名前 21.43%。以上是自己想到的场景,可能具体数据分析的场景有特定的使用方式。

hive 窗口函数ROLLUP hive 窗口函数执行顺序_分析函数_11

总结

Hive 的窗口函数及分析函数基本介绍完了,还有一些比较冷门的函数就不再一一介绍。代码看一百遍不如自己实践一遍,希望大家都可以运用好窗口函数,可以提高一些效率或者提供不一样的分析问题的思路~