hive验证数据格式 hive query

转载

mob64ca13f8eecb 2023-09-20 04:37:24

文章标签 hive验证数据格式窗口函数数据升序 文章分类 Hive 大数据

HiveQL窗口函数

一、窗口函数语法
二、窗口函数类型

数据准备
１、聚合函数

1) sum()
2) avg()
3) count()
4) max()
5) min()

２、序列函数

1) ntile(n)

３、排序函数

1) row_number()
2) rank()
3) dense_rank()
4) cume_dist()
5) percent_rank()

４、窗口函数

1) lead(col,n,DEFAULT)
2) lag(col,n,DEFAULT)

５、极值函数

1) first_value()
2) last_value()

一、窗口函数语法

<窗口函数> over (partition by <列清单> [order by <排序用列清单>] [窗口从句])

窗口从句支持以下格式规范：

从当前行之前某行开始选择:
(rows | range) between (unbounded | [num]) preceding and ([num] preceding | current row | (unbounded | [num]) following))
从当前行开始选择:
(rows | range) between current row and (current row | ([num] | unbounded) following)
从当前行之后某行开始选择:
(rows | range) between [num] following and ([num] | unbounded) following

注：

当order by后面缺少窗口从句条件时，窗口从句默认为range between unbounded preceding and current row;
当order by和窗口从句同时缺少时，窗口从句默认为range between unbounded preceding and unbounded following

二、窗口函数类型

数据准备

建立一个测试表student_tmp，数据如下：

-- 窗口函数测试临时表
create table if not exists student_tmp(
student_id varchar(10) comment '学号'
,subject varchar(30) comment '课程'
,score decimal(5,1) comment '分数'
,es_date varchar(10) comment '成绩录入日期'
)
comment '学生成绩表'
insert into table student_tmp values('123','语文',80.5,'2019-09-01');
insert into table student_tmp values('123','数学',85.0,'2019-09-01');
insert into table student_tmp values('123','英语',90.0,'2019-09-02');
insert into table student_tmp values('135','语文',82.5,'2019-09-03');
insert into table student_tmp values('135','历史',80.5,'2019-09-01');
insert into table student_tmp values('246','语文',92.5,'2019-09-02');
insert into table student_tmp values('246','计算机',83.5,'2019-09-02');

hive验证数据格式 hive query_升序

１、聚合函数

1) sum()

累加

2) avg()

求均值

3) count()

计数

4) max()

求最大值

5) min()

求最小值

-- 聚合函数
select *
,sum(score) over(partition by student_id) as s
,sum(score) over(partition by student_id order by score rows between unbounded preceding and unbounded following) as s_cu
,avg(score) over(partition by student_id) as av
,count() over(partition by student_id) as num
,max(score) over(partition by student_id) as max
,min(score) over(partition by student_id) as min
from student_tmp;

执行结果：

hive验证数据格式 hive query_窗口函数_02

２、序列函数

1) ntile(n)

按顺序将数据尽可能平均地分成n组，返回当前分组号

-- 序列函数
select *
,ntile(2) over(partition by student_id) as se_nu
from student_tmp where se_nu = 1;

执行结果：

hive验证数据格式 hive query_数据_03

３、排序函数

1) row_number()

从1开始，按照数据先后顺序生成该条数据在分组内对应的序列数，row_number()的值不会存在重复，当排序的值相同时，按照表中记录的顺序进行排序。
注： row_number()是按照顺序生成对应的序列，而不是按照排序生成序列，所以一般需要与order by结合使用。

2) rank()

从1开始，按照顺序生成对应的排名，排名相等会在名次中留下空位。（比如：1,2,3,3,5…）

3) dense_rank()

从1开始，按照顺序生成对应的排名，排名相等不会在名次中留下空位。（比如：1,2,3,3,4…）

4) cume_dist()

生成分组内排序的比例值。
其值的计算方式是：小于等于当前值的行数/分组内总行数

5) percent_rank()

生成分组内排序的比例值。
其值的计算方式是：（分组内当前行的rank值-1）/（分组内最大的rank值-1）

-- 排序函数
select *
,row_number() over(partition by student_id) as no_rn
,rank() over(partition by student_id order by es_date) as no_r
,dense_rank() over(partition by student_id order by es_date) as no_dr
,cume_dist() over(partition by student_id order by es_date) as rate_cd
,percent_rank() over(partition by student_id order by es_date) as rate_pr
from student_tmp order by student_id;

执行结果：

hive验证数据格式 hive query_升序_04

４、窗口函数

1) lead(col,n,DEFAULT)

用于统计窗口内往下第n行的值。
其中第一个参数为列名，第二个参数为下行数（可选，默认为1），第三个参数为默认值（当往下第n行为NULL时，取默认值，如不指定，则为NULL）

2) lag(col,n,DEFAULT)

与lead()相反。
用于统计窗口内往上第n行的值。
其中第一个参数为列名，第二个参数为上行数（可选，默认为1），第三个参数为默认值（当往下第n行为NULL时，取默认值，如不指定，则为NULL）

-- 窗口函数
select *
,lead(es_date,1,'2999-12-31') over(partition by student_id) as date_next
,lag(es_date,1,'1111-01-01') over(partition by student_id) as date_last
from student_tmp order by student_id;

执行结果：

hive验证数据格式 hive query_升序_05

５、极值函数

1) first_value()

取分组内排序后，截止到当前行的第一个值。
如果指定为升序，则此值为分组内的最小值；如果指定为降序，则为最大值。（默认为升序）

2) last_value()

与first_value()相反。
取分组内排序后，截止到当前行的最后一个值。
如果指定为升序，则此值为分组内的最大值；如果指定为降序，则为最小值。（默认为升序）

-- 极值函数
select *
,first_value(score) over(partition by student_id) as min_p
,last_value(score) over(partition by student_id) as max_p
from student_tmp order by student_id;

执行结果：

hive验证数据格式 hive query_窗口函数_06

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：hive适合使用的场景 hive适合什么环境

下一篇：hive格式化 hive格式化报错未找到命令

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯