一、相关函数说明

  • over():指定聚合函数工作的数据窗口的大小,这个数据窗口大小会随着行的变而变化。over跟在聚合函数后面,只对聚合函数生效。
  • current row:当前行
  • n preceding:往前n行数据
  • n following:往后n行数据
  • unbounded:起点:unbounded preceding 表示从前面的起点开始,unbounded following表示到后面的终点结束
  • lag(col,n):往前第n行数据
  • lead(col,n):往后第n行数据
  • ntile(n):把有序分区中的行分发到指定数据的组中,各个组有编号,编号从1开始,对于每一行,ntile返回此行所属的组的编号。注意:n必须为int类型。

二、实列说明

应用表如下:

hive中的lead和lag hive lead over_大数据


1、创建hive表并导入数据

create table business(
name string, 
orderdate string,
cost int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
load data local inpath "/opt/module/datas/business.txt" into table business;

2、按需求查询数据
(1)查询在2017年4月份购买过的顾客及总人数

select 
    distinct name,
    count( distinct name) over()
from
    business   
where 
    substring(orderdate,1,7)='2017-04';

或者:

select
    name,
    count(*) over()
from 
    business
where
    substring(orderdate,1,7)='2017-04'
group by name;

查询结果:

hive中的lead和lag hive lead over_hive_02

(2)查询顾客的购买明细及月购买总额

select 
    *,
    sum(cost) over(partition by month(orderdate))
from
    business;

查询结果:

hive中的lead和lag hive lead over_hive中的lead和lag_03

tips:month函数说明:SELECT MONTH('2008-02-03'); # 2

(3)上述的场景, 将每个顾客的cost按照日期进行累加

select 
    *,
    sum(cost) over(partition by name order by orderdate asc
    rows between unbounded preceding and current row)
from 
    business;

查询结果:

hive中的lead和lag hive lead over_大数据_04

select name,orderdate,cost, 
sum(cost) over() as sample1,--所有行相加 
sum(cost) over(partition by name) as sample2,--按name分组,组内数据相加 
sum(cost) over(partition by name order by orderdate) as sample3,--按name分组,组内数据累加 
sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as sample4 ,--和sample3一样,由起点到当前行的聚合 
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and current row) as sample5, --当前行和前面一行做聚合 
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING AND 1 FOLLOWING ) as sample6,--当前行和前边一行及后面一行 
sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7 --当前行及后面所有行 
from business;

查询结果:

hive中的lead和lag hive lead over_hive中的lead和lag_05


**tips:**rows必须跟在Order by 子句之后,对排序的结果进行限制,使用固定的行数来限制分区中的数据行数量

(4)查看顾客上次的购买时间

select 
    *,
    lag(orderdate,1,'1970-01-01') over(partition by name order by orderdate) last_orderdate
from 
    business;

查询结果:

hive中的lead和lag hive lead over_sql_06


(5)查看顾客后面一次的购买时间

select  
    *,
    lead(orderdate,1,"2050-01-01") over(partition by name order by orderdate) next_orderdate
from 
    business;

查询结果:

hive中的lead和lag hive lead over_大数据_07


(6)查询前20%时间的订单信息

select 
    *,
    ntile(5) over(order by orderdate)
 from 
    business;

查询结果:

hive中的lead和lag hive lead over_sql_08

总结:

1.什么时候用开窗函数?
开窗函数常结合聚合函数使用,一般来讲聚合后的行数要少于聚合前的行数,但是有时我们既想显示聚集前的数据,又要显示聚集后的数据,这时我们便引入了窗口函数
2.窗口函数的语法:
UDAF() over (PARTITION By col1,col2 order by col3 窗口子句(rows between .. and ..)) AS 列别名
注意:PARTITION By后可跟多个字段,order By只跟一个字段。

partition by子句:
一旦指定了partition by子句,聚合函数的作用范围就是分区之后的数据,这一点和group by 有些类似
 
order by子句:
order by子句对字段进行排序,如果order by子句后面没有跟rows between ** and ** 则表示起点到当前行
的聚合。order by后的rows子句近一步限制聚合函数的作用范围。
  
注意:
(1)order by必须跟在partition by后;
(2)Rows必须跟在Order by子;
(3)(partition by .. order by)可替换为(distribute by .. sort by ..)