Hive窗口函数

原创

mb58bd586c30411 2021-07-15 10:57:13 ©著作权

文章标签 Hive 经验分享 文章分类 Hadoop 大数据

©著作权归作者所有：来自51CTO博客作者mb58bd586c30411的原创作品，请联系作者获取转载授权，否则将追究法律责任

1. 准备数据

2. 函数

2.1 sum(), min(),max(),avg()等聚合函数 ——求解窗口期内的数据的总和

2.2 新增加序号列Ntile, Row_Number(), Rank(), Dense_Rank()

2.3 lag, lead, first_value, last_value 错位

2.4 grouping set, cube, roll up

1. grouping set(key1, key2)

3. cube

4. roll up

1. 准备数据

create table lxy (cookieid int, create_time string, pv int) row format delimited fields terminated by ',';
load data inpath '/user/lxy.csv';

2. 函数

2.1 sum(), min(),max(),avg()等聚合函数 ——求解窗口期内的数据的总和

对一定窗口期内的数据进行聚合

select *,
sum(a.pv) over (partition by cookieid order by create_time rows between 3 preceding and current row) as pv1,
sum(a.pv) over (partition by cookieid order by create_time rows between 2 preceding and 1 following) as pv2
from lxy as a;

Hive窗口函数_经验分享

在这里根据cookieid进行分组，然后按照create_time进行分组，选择不同的窗口进行一定函数的聚合运算。基本的语法是rows between 一个时间点 and 一个时间点。时间点分别可以是以当前行作为参考系，前面几行n preceding或者是后面几行c following，也可以是当前行current row。总之可以想象有一个滑动窗口，我们可以规定一个滑动窗口的中心位置和大小，然后每次画过一个步长，计算一次窗口内的值。

2.2 新增加序号列Ntile, Row_Number(), Rank(), Dense_Rank()

数据源

select *,
   ntile(3) over (partition by cookid2 order by pv) as n1,
   row_number() over (partition by cookid2 order by pv) as n2,
   rank() over (partition by cookid2 order by pv) as n3,
   dense_rank() over (partition by cookid2 order by pv) as n4
from lxy3;

Hive窗口函数_Hive_02

对于Ntile函数，传入的参数N是指要切分成多少份，返回对应的序号，
Row_Number()则是生成一列连续的序号，
Rank()与Row_Number()类似，只是对于数值相同的这一项会同时为相同的序号，下一个序号跳过，比如倒数第二列当中有出现4，4，6没有5；
Dense_Rank()则相反，会紧跟着下一个是紧接着的序号，比如4，4，5。

2.3 lag, lead, first_value, last_value 错位

lag是迟滞的意思，也就是对某一列进行往后错行；
lead是lag的反义词，也就是对某一列进行提前几行；
first_value是对该列到目前为止的首个值，
last_value是到目前行为止的最后一个值。

select *,
   lag(pv, 2) over(partition by cookid2 order by log_date) as lag1,
   lead(pv, 2, 0) over(partition by cookid2 order by log_date) as lead1,
   first_value() over(partition by cookid2 order by log_date) as first_pv,
   first_value() over(partition by cookid2 order by log_date) as last_pv,
   last_value() over(partition by cookid2 order by log_date) as current_last_pv
from lxy3;

Hive窗口函数_经验分享_03

返回结果:LAG和LEAD里面都是传入三个参数，分别是排序的列名，滞后/往前的行数，以及默认填充值。因为我们在这里的LEAD()里面设置默认填充值为0，所以对于cookid后面两行缺失值填充为0。如果我们要返回每个分组下排序后的最后一个数，可以对该组进行DESC的操作，注意ORDER BY对返回的结果很有影响。

2.4 grouping set, cube, roll up

准备数据：

create external table lxw1234 (
   month string,
   day string,
   cookieid string
) row format delimited fields terminated by ',' location '/user/chenlinlin2156233/lxy2/';

1. grouping set(key1, key2)

相当于是对不同字段进行group操作以后，再进行union all的操作。

select month,day,count(distinct cookieid) as count_id,grouping__id
from lxw1234
group by month, day
grouping sets(month, day)
order by grouping__id;

Hive窗口函数_Hive_05

1. grouping_id是自动生成的，是进行了grouping_set()的操作之后。
2. 下划线有两个：grouping__id
3. 需要先做group by操作再传入grouping sets ，等价于先group再union all的做法

3. cube

就是比以上的grouping sets多了一个两列的整合，也就是笛卡尔乘积。

grouping set:

select month,null,count(distinct cookieid) as uv,1 as grouping__id from lxw1234 group by month
union all
select null,day,count(distinct cookieid) as uv,2 as grouping__id from lxw1234 group by day
union all
select month,day,count(distinct cookieid) as uv,3 as grouping__id from lxw1234 group by month,day

Hive窗口函数_经验分享_06

cube:

select month,day,count(distinct cookieid) as count_id,grouping__id
from lxw1234
group by month, day
with cube
order by grouping__id;

4. roll up

select month,day,count(distinct cookieid) as count_id,grouping__id
from lxw1234
group by month, day
with rollup
order by grouping__id;

Hive窗口函数_经验分享_08

rollup返回的结果:可以看到，这个时候就不会返回以右边为关键字的聚合结果，只是返回左边的键以及笛卡尔乘积的结果。

我们如果换一下聚合的关键字month和day的顺序呢？

select month,day,count(distinct cookieid) as count_id,grouping__id
from lxw1234
group by day, month
with rollup
order by grouping__id;

交换关键字以后的返回结果:从上面结果可以看到，关键字的顺序对rollup的结果也是很有影响的。
Hive窗口函数_经验分享_09

如果您看到了这，请点个赞

上一篇：十大经典排序算法

下一篇：教室调度问题—贪婪算法—要求尽可能多的将课程安排在某间教室，如何安排？——结束时间最早的课,动态规划——背包问题——贪心算法

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯