窗口函数 spark sql 窗口函数和group by的区别

转载

技术笔耕者 2023-09-23 15:13:07

文章标签 窗口函数 spark sql 数据分析窗口函数字段类目 文章分类 Spark 大数据

一.SQL中的进阶函数

1.窗口函数

窗口函数和普通聚合函数的区别：

①聚合函数是将多条记录聚合为一条；窗口函数是每条记录都会执行，有几条记录执行完还是几条。

②聚合函数也可以用于窗口函数。

原因就在于窗口函数的执行顺序（逻辑上的）是在FROM，JOIN，WHERE，GROUP　BY，HAVING之后，在ORDER　BY，LIMIT，SELECT　DISTINCT之前。它执行时GROUP　BY的聚合过程已经完成了，所以不会再产生数据聚合。

2.窗口函数的基本用法：

函数名 OVER 子句

over关键字用来指定函数执行的窗口范围,若后面括号中什么都不写,则意味着窗口包含满足WHERE条件的所有行，窗口函数基于所有行进行计算；如果不为空,则支持以下4中语法来设置窗口。

①window_name:给窗口指定一个别名。如果SQL中涉及的窗口较多,采用别名可以看起来更清晰易读

②partition by子句：窗口按照哪些字段进行分组,窗口函数在不同的分组上分别执行

③order by子句：按照哪些字段进行排序,窗口函数将按照排序后的记录顺序进行编号

④frame子句：frame是当前分区的一个子集,子句用来定义子集的规则,通常用来作为滑动窗口使用

3.序号函数:row_number(),rank(),dense_rank()

ROW_NUMBER():顺序排序——1、2、3

RANK():并列排序，跳过重复序号——1、1、3

DENSE_RANK():并列排序，不跳过重复序号——1、1、2

4.分布函数:percent_rank(),cume_dist()

percent_rank():

每行按照公式(rank-1) / (rows-1)进行计算。其中，rank为RANK()函数产生的序号，rows为当前窗口的记录总行数

--给窗口指定别名：WINDOW w AS (PARTITION BY stu_id ORDER BY score) rows = 5
mysql> SELECT
    -> RANK() OVER w AS rk,
    -> PERCENT_RANK() OVER w AS prk,
    -> stu_id, lesson_id, score
    -> FROM t_score
    -> WHERE stu_id = 1
    -> WINDOW w AS (PARTITION BY stu_id ORDER BY score)
    -> ;
+----+------+--------+-----------+-------+
| rk | prk  | stu_id | lesson_id | score |
+----+------+--------+-----------+-------+
|  1 |    0 |      1 | L003      |    79 |
|  2 | 0.25 |      1 | L002      |    86 |
|  3 |  0.5 |      1 | L004      |    88 |
|  4 | 0.75 |      1 | L005      |    98 |
|  4 | 0.75 |      1 | L001      |    98 |
+----+------+--------+-----------+-------+

cume_dist():

分组内小于、等于当前rank值的行数 / 分组内总行数 eg:查询小于等于当前成绩（score）的比例

--cd1：没有分区，则所有数据均为一组，总行数为8
--cd2：按照lesson_id分成了两组，行数各为4
mysql> SELECT stu_id, lesson_id, score,
    -> CUME_DIST() OVER (ORDER BY score) AS cd1,
    -> CUME_DIST() OVER (PARTITION BY lesson_id ORDER BY score) AS cd2
    -> FROM t_score
    -> WHERE lesson_id IN ('L001','L002')
    -> ;
+--------+-----------+-------+-------+------+
| stu_id | lesson_id | score | cd1   | cd2  |
+--------+-----------+-------+-------+------+
|      2 | L001      |    84 | 0.125 | 0.25 |
|      1 | L001      |    98 |  0.75 |  0.5 |
|      4 | L001      |    99 | 0.875 | 0.75 |
|      3 | L001      |   100 |     1 |    1 |
|      1 | L002      |    86 |  0.25 | 0.25 |
|      4 | L002      |    88 | 0.375 |  0.5 |
|      2 | L002      |    90 |   0.5 | 0.75 |
|      3 | L002      |    91 | 0.625 |    1 |
+--------+-----------+-------+-------+------+

5.前后函数:lag(expr,n),lead(expr,n)

用途：返回位于当前行的前n行（LAG(expr,n)）或后n行（LEAD(expr,n)）的expr的值
应用场景：查询前1名同学的成绩和当前同学成绩的差值

mysql> SELECT stu_id, lesson_id, score, pre_score,
    -> score-pre_score AS diff
    -> FROM(
    ->     SELECT stu_id, lesson_id, score,
    ->     LAG(score,1) OVER w AS pre_score
    ->     FROM t_score
    ->     WHERE lesson_id IN ('L001','L002')
    ->     WINDOW w AS (PARTITION BY lesson_id ORDER BY score)) t
    -> ;
+--------+-----------+-------+-----------+------+
| stu_id | lesson_id | score | pre_score | diff |
+--------+-----------+-------+-----------+------+
|      2 | L001      |    84 |      NULL | NULL |
|      1 | L001      |    98 |        84 |   14 |
|      4 | L001      |    99 |        98 |    1 |
|      3 | L001      |   100 |        99 |    1 |
|      1 | L002      |    86 |      NULL | NULL |
|      4 | L002      |    88 |        86 |    2 |
|      2 | L002      |    90 |        88 |    2 |
|      3 | L002      |    91 |        90 |    1 |
+--------+-----------+-------+-----------+------+

6.头尾函数：FIRST_VALUE(expr),LAST_VALUE(expr)

用途：返回第一个（FIRST_VALUE(expr)）或最后一个（LAST_VALUE(expr)）expr的值
应用场景：截止到当前成绩，按照日期排序查询第1个和最后1个同学的分数

mysql> SELECT stu_id, lesson_id, score, create_time,
    -> FIRST_VALUE(score) OVER w AS first_score,
    -> LAST_VALUE(score) OVER w AS last_score
    -> FROM t_score
    -> WHERE lesson_id IN ('L001','L002')
    -> WINDOW w AS (PARTITION BY lesson_id ORDER BY create_time)
    -> ;
+--------+-----------+-------+-------------+-------------+------------+
| stu_id | lesson_id | score | create_time | first_score | last_score |
+--------+-----------+-------+-------------+-------------+------------+
|      3 | L001      |   100 | 2018-08-07  |         100 |        100 |
|      1 | L001      |    98 | 2018-08-08  |         100 |         98 |
|      2 | L001      |    84 | 2018-08-09  |         100 |         99 |
|      4 | L001      |    99 | 2018-08-09  |         100 |         99 |
|      3 | L002      |    91 | 2018-08-07  |          91 |         91 |
|      1 | L002      |    86 | 2018-08-08  |          91 |         86 |
|      2 | L002      |    90 | 2018-08-09  |          91 |         90 |
|      4 | L002      |    88 | 2018-08-10  |          91 |         88 |
+--------+-----------+-------+-------------+-------------+------------+

二.case专题-商品订单数据

数据表:

订单表orders,大概字段有（order_id'订单号,'user_id‘用户编号’, order_pay‘订单金额’ , order_time‘下单时间’,'商品一级类目commodity_level_I','商品二级类目commodity_level_2'）

1.求最近7天内每一个一级类目下成交总额排名前3的二级类目:

select commodity_level_1 as '商品一级类目',
			 commodity_level_2 as '商品二级类目',
			 total_pay as '成交总额'
from
(
  select commodity_level_1,commodity_level_2,total_pay
  			 row_number()over(partition by commodity_level_1 order by a.total_pay desc) as rank
  from
  (
    select commodity_level_1,
    			 commodity_level_2,
    			 sum(order_pay) as total_pay
    from orders
    where datediff(now(),order_time) <= 7
    group by commodity_level_1,commodity_level_2,
  ) a
) b
where rank <= 3

2.提取8.1-8.10每一天消费金额排名在101-195的user_id

select order_date as '订单日期',
			 user_id,
			 total_pay as '消费金额'
from
(
  select order_date,user_id,total_pay
  			 row_number()over(partition by order_date order by a.total_pay desc) as rank
  from
  (
    select convert(order_time,date) as order_date,
    			 user_id,
    			 sum(order_pay) as total_pay
    from orders
    where convert(order_time,date) between '20190801' and '20190810'
    group by convert(order_time,date),user_id
  ) a
) b
where rank between 101 and 195

三.case专题-活动运营数据分析

数据表

表1——订单表orders,大概字段有（user_id‘用户编号’, order_pay‘订单金额’ , order_time‘下单时间’)

表2——活动报名表act_apply,大概字段有（act_id‘活动编号’, user_id‘报名用户’,act_time‘报名时间’）

1.活动运营数据分析-统计每个活动对应所有用户在报名后产生的总订单金额，总订单数

select act_id as '活动编号', COUNT(*) as '订单数', SUM(order_pay) as '总金额'
from orders LEFT JOIN act_apply on orders.user_id=act_apply.user_id 
where order_time >= act_time
GROUP BY act_id

2.统计每个活动从开始后到当天（考试日）平均每天产生的订单数，活动开始时间定义为最早有用户报名的时间。（涉及到时间的数据类型均为：datetime）

解题思路:

(1)找到活动最早用户报名的时间用窗口函数min()over()

(2)计算最早报名时间到当前时间的天数差,datediff(now(),act_start)

select act_id as '活动编号',count(*)/datediff(now(),act_start)
from orders a
left join
(
  select act_id,user_id,act_time,min()over(partition by act_id order act_time) as act_start
  from act_apply
)b
on a.user_id = b.user_id
where order_time >= act_time
group by act_id,act_start

四.case专题-用户行为路径分析

表1——用户行为表tracking_log，大概字段有（user_id‘用户编号’,opr_id‘操作编号’,log_time‘操作时间’）

1.统计每天符合以下条件的用户数：A操作之后是B操作，AB操作必须相邻

解题思路:运用窗口函数在每行数据生成一列数据:下次操作编号

注意要把datetime格式转换成date格式来表示天

select Date,count(*) as 'A-B路径用户计数'
from
(
  select distinct user_id,opr_id as 'cur_opr',convert(log_time,date) as 'Date',
         lead(opr_id,1)over(partition by user_id,convert(log_time,date) order by log_time) 					 as 'next_opr'
  from tracking_log
) t
where t.cur_opr = t.next_opr
group by Date

2.统计用户行为序列为A-B-D的用户数

其中:A-B之间可以有任何其他浏览记录(如C,E等),B-D之间除了C记录可以有任何其他浏览记录(如A,E等)

select count(user_id) as result
from
(
	select user_id,group_concat(opr_id order by log_time) as user_behavior_path
	from tracking_log
	group by user_id
	having (user_behavior_path like '%A%B%D%') 
  			 and (user_behavior_path not like '%A%B%C%D%')
) t

五.case专题-用户留存分析

表1——用户登陆表user_log，大概字段有（user_id‘用户编号’，log_time‘登陆时间’)

1.求每天新增用户数，以及他们第2天、30天的留存率

select a.date as '日期',a.new as '新增用户',
			 b.2_back / a.new as '第2天回访率',
			 c.30_back / a.new as '第3天回访率',
from
(
  select convert(aa.first_time,date) as 'date',count(distinct user_id) as 'new'
  from
  (
    --找到每个用户第一次登陆时间,按第一次登陆日期聚合,统计user_id,即每天新增用户量
    select user_id,min(log_time)over(partition by user_id) as first_time,log_time
    from user_log
  )aa
  group by convert(aa.first_time,date)
)a
left join
(
  --计算第二天回访的用户量,计算当前时间与第一次登陆的时间差
  select convert(bb.first_time,date) as 'date',count(distinct bb.user_id) as '2_back'
  from
  (
    select user_id,min(log_time)over(partition by user_id) as first_time,log_time
    from user_log
    where datediff(log_time,first_time) = 1
  )bb
  group by convert(bb.first_time,date)
)b on a.date = b.date
left join
(
  --计算第二天回访的用户量,计算当前时间与第一次登陆的时间差
  select convert(cc.first_time,date) as 'date',count(distinct cc.user_id) as '30_back'
  from
  (
    select user_id,min(log_time)over(partition by user_id) as first_time,log_time
    from user_log
    where datediff(log_time,first_time) = 29
  )cc
  group by convert(bb.first_time,date)
)c on a.date = c.date

2.找近90天,30天,7天的登录人数

select count(distinct case when datediff(now(),log_time)<=90 then user_id else null end) as 90_log_users,
	   count(distinct case when datediff(now(),log_time)<=30 then user_id else null end) as 30_log_users,
	   count(distinct case when datediff(now(),log_time)<=7 then user_id else null end) as 7_log_users
from user_log

3.求用户近一个月平均登录时间间隔（按天）

select user_id,avg(diff)
from
(
	select user_id,lead(log_time)over(partition user_id order by log_time) - log_time as diff
	from user_log
	where datediff(now(),log_time)<=30
)
group by user_id

六.case专题-统计特征(中位数,众数,四分位数)

字段：订单号(order_id),店铺id(shop_id),销量(sale),商品id(commodity_id)

1.求每个店铺订单销量排名的中位数

select avg(sale)
from
(
  select sale,
  			 count(*)over(partition by shop_id) as total, --shop对应数据行数
   			 --用来计算偶数中位数
  			 cast(count(*)over(partition by shop_id) as decimal) / 2 as even_mid,
				 --用来计算奇数中位数
  			 ceiling(cast(count(*)over(partition by shop_id) as decimal)/2)as odd_mid,
  			 row_number()over(partition by shop_id order by sale) as sale_rank
  from orders
) t
where ( total%2=0 and sale_rank in (even_mid,even_mid+1)) or ( total%2=1 and sale_rank = mid)
group by shop_id

2.求每个店铺订购商品的众数

预备知识:

all(子查询) 表示子查询中的所有数据 eg: >= all(子查询) 表示大于等于子查询中所有数据

any(子查询)表示子查询中的任一数据 eg: >= any(子查询) 表示只有大于等于子查询中任意一个数据就返回true

select shop_id,commodity_id,order_num
from
(
	select shop_id,commodity_id,order_num,max(order_num)over(partition by shop_id) as max_order_num
	(
		select shop_id,commodity_id,count(*) as order_num
		from orders
		group by shop_id,commodity_id
	) t1
) t2
where order_num = max_order_num

3.求四分位数

三个表

T1：good_id,cate_id(分类)

T2:mall_id(店铺), good_id

T3:mall_id, credit_score(信用分)

问，在不同分类中,店铺的信用分前100和求top25%

select cate_id,mall_id,credit_score
from
(
	select cate_id,mall_id,
		   credit_score,
		   row_number()over(partition by cate_id order by credit_score desc) as score_rn
		   count(*)over(partition by cate_id) as mall_total
	from
	(	--选择商品分类id,店铺id,商品信用分数
		select T1.cate_id,T2.mall_id,credit_score
		from T1 left join (T2 left join T3 on T2.mall_id = T3.mall_id) on T1.good_id = T2.good_id
		group by T1.cate_id,T2.mall_id
	)t
)
where score_rn <= mall_total*0.25

七.case专题-GMV周同比统计

字段:时间(sale_date),店铺类别(cate_id),店铺数量(mall_num),gmv

1.拼多多618前后一周内各店铺类别gmv的日均提升幅度和比例

select cate_id,avg(gmv_diff),avg(gmv_rate)
from
(
	select sale_date,cate_id,mall_num,gmv,
  			 (lead(gmv,7)over(partition by cate_id order by sale_date)-gmv) as gmv_diff,
  			 (lead(gmv,7)over(partition by cate_id order by sale_date)-gmv) / gmv as gmv_rate
	from T
	where sale_date between '20190611' and '20190624'
)t
where sale_date between '20190611' and '20190617'
group by cate_id,mall_num,gmv

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。