hive sql 根据条件分类连续计数

转载

梦断蓝桥魂 2024-09-18 19:36:38

文章标签 hive sql 根据条件分类连续计数数据配置参数表名 文章分类 Hive 大数据

数据分析笔试中累加问题是非常常见的考题，今天我们用一个函数来搞定它

sum over(partition by 分组列 order by 排序列 rows between 开始位置 preceding and 结束位置 following)

其中'开始位置'和'结束位置'可配置参数：数据、current、UNBOUNDED

问题1

数据集有三列：userid，month，count，统计每个用户截止到当月为止的最大单月访问次数和累计到该月的总访问次数

select 
  Userid
  ,month
  ,max(count) over (partition by userid,month) as max_cnt
  ,sum(count) over (partition by userid,month) as sum_cnt
from 
  Table

问题2

有这样一张表，记录了每个月的营业额，数据如下：

表名：test
字段：month｜money

统计截止到每个月份的营业总额

-- 方法1

select sum(money) from test where month='201901'
union all
select sum(money) from test where month<='201902'
union all 
select sum(money) from test where month<='201903'

优点：这是最容易想到的解决方法，实现起来又方便
缺点：此方法虽能解决问题，可略显笨重，且同一张表扫描次数过多，执行速度缓慢，不可取。

-- 方法2

select
    sum(case when month=201901 then money else null end) as sum_money_01,
    sum(case when month<=201902 then money else null end) as sum_money_02,
    sum(case when month<=201903 then money else null end) as sum_money_03,
    sum(case when month<=201904 then money else null end) as sum_money_04,
    sum(case when month<=201905 then money else null end) as sum_money_05,
    sum(case when month<=201906 then money else null end) as sum_money_06,
    sum(case when month<=201907 then money else null end) as sum_money_07,
    sum(case when month<=201908 then money else null end) as sum_money_08,
    sum(case when month<=201909 then money else null end) as sum_money_09,
    sum(case when month<=201910 then money else null end) as sum_money_10,
    sum(case when month<=201911 then money else null end) as sum_money_11,
    sum(case when month<=201912 then money else null end) as sum_money_12
from 
    test

"""
优点：对test表只进行了一次扫描，判断month是属于哪个范围，符合写定范围则进行聚合运算，否则置为null利用sum函数不计算null值的特性，得到我们想要的答案。从某种程度上来说，case when 在此处发挥的是where的功能。
缺点：写法还是有些复杂，这是12个月我们就需要写12个语句，若是需求更加的细化，比如计算一年中截止到每一天的营业额，我们就需要写365个语句，显然是不可能的，可拓展性不高
"""

-- 方法3

select 
    t0.month,
    sum(t1.money)
from
    (select month,money from test)t0
join
    (select month,money from test)t1
on t0.month>=t1.month
group by t0.month

"""
优点：此种写法拓展性比较高，不用关心计算粒度多大，都能实现需求
缺点：实际上此处进行的是一种不完全的笛卡尔积，增加了数据量，若数据量比较大的情况下可能会导致执行速度下降。
"""

--方法4

select 
    month,
    sum(money) over(order by month rows between unbounded preceding and current row) as sum_money
from 
    test

"""
preceding：往前
following：往后
current row：当前行
unbounded：无界限（起点或终点）
unbounded preceding：表示从前面的起点
unbounded following：表示到后面的终点

当order by后面缺少窗口从句条件，窗口规范默认是
row between unbounded preceding and current row. （从起点到当前行）
当order by和窗口从句都缺失, 窗口规范默认是
row between unbounded preceding and unbounded following.
（从起点到终点）
"""

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。