hive 窗口函数lag hive 窗口函数求占比

转载

mob64ca1415f0ab 2023-09-01 19:47:17

文章标签 hive 窗口函数lag hive 配置文件窗口函数 文章分类 Hive 大数据

文章目录

一、窗口函数
二、自定义函数
三、案例

一、窗口函数

如果要查询详细记录和聚合函数，必须要经过两次查询，比较麻烦，这个时候使用窗口函数，会方便很多

hive 窗口函数lag hive 窗口函数求占比_hive 窗口函数lag

hive 窗口函数lag hive 窗口函数求占比_hive_02

（1）窗口函数是针对每一行数据的

（2）如果over中没有指定参数，则默认窗口大小为全部结果集

注意：开窗函数不能写在where里面

--求01课程的平均成绩
select
*,
avg(sc.s_score) over()
from mydb2.score sc
where sc.c_id='01'

--求每门课的平均成绩
select 
*,
avg(sc.s_score) over(distribute by sc.c_id) avg_score
from mydb2.score 
;

--求每门功课在学生的总成绩中的占比情况
selct
*,
sc.s_score/sum(sc.s_score) over(distribute by sc.s_id)  sum_score
from mgdb2.score sc

排名函数
第一种函数：row_number从1开始，按照顺序，生成分组内记录的序列,row_number()的值不会存在重复,当排序的值相同时,按照表中记录的顺序进行排列

第二种函数：RANK()生成数据项在分组中的排名，排名相等会在名次中留下空位

hive 窗口函数lag hive 窗口函数求占比_窗口函数_03

第三种函数：DENSE_RANK()生成数据项在分组中的排名，排名相等会在名次中不会留下空位

hive 窗口函数lag hive 窗口函数求占比_hive 窗口函数lag_04

--返回每位同学的每门功课的排名情况
select
*,
row_number() over(distribute by sc.c_id sort by sc.s_score desc) rn,
rank() over(distribute by sc.c_id sort by sc.s_score desc) r,
dense_rank() over(distribute by sc.c_id sort by sc.s_score desc) sr
from mydb2.score sc

--求每门功课的前三名
select
*
from
(
select
*,
row_number() over(distribute by sc.c_id sort by sc.s_score desc) rn
from mydb2.score sc
)t
where t.rn<4;

--求每门功课与最高分之间的差值
select
*,
max(sc.s_score) over(distribute by sc.c_id) - sc.s_score  diff_score
from mydb2.score sc

distribute by子句
在over窗口中进行分组，对某一字段进行分组统计，窗口的大小就是一个组的所有记录

序列函数

（1）fist_value、last_value
first_value：第一个版本
last_value：最后一个版本

select
*,
first_value(sc.s_score) over (patition by sc.c_id order by sc.s_score desc) max_score,
last_value(sc.s_score) over (patition by sc.c_id order by sc.s_score desc) min_score,

hive 窗口函数lag hive 窗口函数求占比_窗口函数_05

应当修改成下面这种

select
*,
first_value(sc.s_score) over (patition by sc.c_id order by sc.s_score desc rows between unbounded preceding and unbounded following) max_score,
last_value(sc.s_score) over (patition by sc.c_id order by sc.s_score desc rows between unbounded preceding and unbounded following) min_score,

hive 窗口函数lag hive 窗口函数求占比_hive 窗口函数lag_06

默认窗口

hive 窗口函数lag hive 窗口函数求占比_hive_07

window子句
给窗口设定窗口大小
默认的窗口大小都是从第一行到当前行
窗口的带下可以设置成以下：
（1）preceding：往前
（2）following：往后
（3）current row：当前行
（4）unbounded：边界
（5）unbounded preceding：表示起点
（6）unbounded following：表示终点
窗口大小的写法：
rows between and
lag、lead
lag返回当前数据行的前第n行的数据
lead返回当前数据行的后第n行的数据
用法：
lag(columnName,offset,default_value)
lead(columnName,offset,default_value)

--求出学生每次考试与前一次考试的成绩差
select
*,
lag(sc.s_score,1,sc.s_score) over(distribute by sc.s_id sort by sc.c_id asc) as up_score  --default_value:表示如果上一次考试为空，则给一个默认值，也就是sc.s_score
from mydb2.score sc
;

ntile
ntile 是Hive很强大的一个分析函数。可以看成是：它把有序的数据集合平均分配到指定的数量（num）个桶中, 将桶号分配给每一行。如果不能平均分配，则优先分配较小编号的桶，并且各个桶中能放的行数最多相差1。

--求每门功课成绩最好的前20%的学生
分析：每门功课分成5份
select
*,
ntile(5) over(distribute by sc.c_id sort by sc.s_score desc) ranks
from mydb2.score sc
;

distribute by和partition by的区别
distribute by对应MapReduce中的自定义分组比较器
partition by是自定义分区

sort by和order by的区别

cume_dist
percentile
percent_rank

窗口函数：所有的聚合函数都可以是窗口函数
窗口大小：默认是从第一行到最后一行，但是max

什么情况下使用窗口函数？
既有明细数据，又有返回汇总函数

二、自定义函数

自定义函数应当要遵循实现单一的功能原则

为什么要自定义函数？
hive的内置函数不能满足所有的业务需求

hive提供了很多模块都可以自定义功能：
比如自定义函数、serde（序列化反序列化函数）、输入输出格式等
UDF(User-Defined-Function) ----- 一进一出

UDAF（User- Defined Aggregation Funcation） ----- 多进一出（聚合函数MR）

UDTF(User-Defined Table-Generating Functions) ----- 一进多出（生成多行结果）

编写UDF
方式1.继承UDF，重写evaluate()，允许重载
方式2.继承genericUDF，重写initlizer()、getdisplay()、evaluate()

-- 小写转大写

hive 窗口函数lag hive 窗口函数求占比_窗口函数_08

UDF使用方式
第一种：本session有效
1.将自定义的jar上传到服务器，并将jar包添加到hive的class path中

hive>add jar jar包名；

或者配置hive.aux.jars.path
将文件放到这个路径中

2.添加一个临时函数名

create temporary function toupcase as

hive 窗口函数lag hive 窗口函数求占比_hive 窗口函数lag_09

3.测试自定义函数的方法

（1）

hive 窗口函数lag hive 窗口函数求占比_配置文件_10

（2）带from的测试

create table if not exists dual(url string);
insert into table dual values(''); --保证这个表有一条记录即可

hive 窗口函数lag hive 窗口函数求占比_窗口函数_11

4.用完之后删除函数

临时的session也可以不删除直接退出hive>drop function if exists toupcase;

会报错：

hive 窗口函数lag hive 窗口函数求占比_窗口函数_12

解决方式

hive>drop temporary function if exists toupcase;第二种方式：

1.上传jar包

2.创建一个配置文件，将上面的创建临时函数的语句写入到配置文件中

hive在启动的时候，可以指定加载这个配置文件

将下面的内容添加进配置文件hive-init文件

add jar jar包名

create temporary function toupcase as

hive 窗口函数lag hive 窗口函数求占比_hive 窗口函数lag_13

加载方式为hive -i /root/hive-init

第三种方式：
hive的安装目录的bin目录下创建一个文件，文件名叫.hiverc
在配置文件中将上面的创建临时函数的语句写入到配置文件
直接启动hive，不需要加载配置文件

hive 窗口函数lag hive 窗口函数求占比_配置文件_14

查看是否添加成功：show functions;

第四种方式：编译源码

三、案例

--生日转换成年龄
输入：1990-12-01
输入：29

hive 窗口函数lag hive 窗口函数求占比_hive_15

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：ruby 镜像 ruby界面

下一篇：hive 结构体类型如何转成字符串 hive desc表结构

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯