一、相关函数说明
- over():指定聚合函数工作的数据窗口的大小,这个数据窗口大小会随着行的变而变化。over跟在聚合函数后面,只对聚合函数生效。
- current row:当前行
- n preceding:往前n行数据
- n following:往后n行数据
- unbounded:起点:unbounded preceding 表示从前面的起点开始,unbounded following表示到后面的终点结束
- lag(col,n):往前第n行数据
- lead(col,n):往后第n行数据
- ntile(n):把有序分区中的行分发到指定数据的组中,各个组有编号,编号从1开始,对于每一行,ntile返回此行所属的组的编号。注意:n必须为int类型。
二、实列说明
应用表如下:
1、创建hive表并导入数据
create table business(
name string,
orderdate string,
cost int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
load data local inpath "/opt/module/datas/business.txt" into table business;
2、按需求查询数据
(1)查询在2017年4月份购买过的顾客及总人数
select
distinct name,
count( distinct name) over()
from
business
where
substring(orderdate,1,7)='2017-04';
或者:
select
name,
count(*) over()
from
business
where
substring(orderdate,1,7)='2017-04'
group by name;
查询结果:
(2)查询顾客的购买明细及月购买总额
select
*,
sum(cost) over(partition by month(orderdate))
from
business;
查询结果:
tips:month函数说明:SELECT MONTH('2008-02-03'); # 2
(3)上述的场景, 将每个顾客的cost按照日期进行累加
select
*,
sum(cost) over(partition by name order by orderdate asc
rows between unbounded preceding and current row)
from
business;
查询结果:
select name,orderdate,cost,
sum(cost) over() as sample1,--所有行相加
sum(cost) over(partition by name) as sample2,--按name分组,组内数据相加
sum(cost) over(partition by name order by orderdate) as sample3,--按name分组,组内数据累加
sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as sample4 ,--和sample3一样,由起点到当前行的聚合
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and current row) as sample5, --当前行和前面一行做聚合
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING AND 1 FOLLOWING ) as sample6,--当前行和前边一行及后面一行
sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7 --当前行及后面所有行
from business;
查询结果:
**tips:**rows必须跟在Order by 子句之后,对排序的结果进行限制,使用固定的行数来限制分区中的数据行数量
(4)查看顾客上次的购买时间
select
*,
lag(orderdate,1,'1970-01-01') over(partition by name order by orderdate) last_orderdate
from
business;
查询结果:
(5)查看顾客后面一次的购买时间
select
*,
lead(orderdate,1,"2050-01-01") over(partition by name order by orderdate) next_orderdate
from
business;
查询结果:
(6)查询前20%时间的订单信息
select
*,
ntile(5) over(order by orderdate)
from
business;
查询结果:
总结:
1.什么时候用开窗函数?
开窗函数常结合聚合函数使用,一般来讲聚合后的行数要少于聚合前的行数,但是有时我们既想显示聚集前的数据,又要显示聚集后的数据,这时我们便引入了窗口函数
2.窗口函数的语法:
UDAF() over (PARTITION By col1,col2 order by col3 窗口子句(rows between .. and ..)) AS 列别名
注意:PARTITION By后可跟多个字段,order By只跟一个字段。
partition by子句:
一旦指定了partition by子句,聚合函数的作用范围就是分区之后的数据,这一点和group by 有些类似
order by子句:
order by子句对字段进行排序,如果order by子句后面没有跟rows between ** and ** 则表示起点到当前行
的聚合。order by后的rows子句近一步限制聚合函数的作用范围。
注意:
(1)order by必须跟在partition by后;
(2)Rows必须跟在Order by子;
(3)(partition by .. order by)可替换为(distribute by .. sort by ..)