一.SQL中的进阶函数
1.窗口函数
窗口函数和普通聚合函数的区别:
①聚合函数是将多条记录聚合为一条;窗口函数是每条记录都会执行,有几条记录执行完还是几条。
②聚合函数也可以用于窗口函数。
原因就在于窗口函数的执行顺序(逻辑上的)是在FROM,JOIN,WHERE,GROUP BY,HAVING之后,在ORDER BY,LIMIT,SELECT DISTINCT之前。它执行时GROUP BY的聚合过程已经完成了,所以不会再产生数据聚合。
2.窗口函数的基本用法:
函数名 OVER 子句
over关键字用来指定函数执行的窗口范围,若后面括号中什么都不写,则意味着窗口包含满足WHERE条件的所有行,窗口函数基于所有行进行计算;如果不为空,则支持以下4中语法来设置窗口。
①window_name:给窗口指定一个别名。如果SQL中涉及的窗口较多,采用别名可以看起来更清晰易读
②partition by子句:窗口按照哪些字段进行分组,窗口函数在不同的分组上分别执行
③order by子句:按照哪些字段进行排序,窗口函数将按照排序后的记录顺序进行编号
④frame子句:frame是当前分区的一个子集,子句用来定义子集的规则,通常用来作为滑动窗口使用
3.序号函数:row_number(),rank(),dense_rank()
ROW_NUMBER():顺序排序——1、2、3
RANK():并列排序,跳过重复序号——1、1、3
DENSE_RANK():并列排序,不跳过重复序号——1、1、2
4.分布函数:percent_rank(),cume_dist()
percent_rank():
每行按照公式(rank-1) / (rows-1)
进行计算。其中,rank
为RANK()函数
产生的序号,rows
为当前窗口的记录总行数
--给窗口指定别名:WINDOW w AS (PARTITION BY stu_id ORDER BY score) rows = 5
mysql> SELECT
-> RANK() OVER w AS rk,
-> PERCENT_RANK() OVER w AS prk,
-> stu_id, lesson_id, score
-> FROM t_score
-> WHERE stu_id = 1
-> WINDOW w AS (PARTITION BY stu_id ORDER BY score)
-> ;
+----+------+--------+-----------+-------+
| rk | prk | stu_id | lesson_id | score |
+----+------+--------+-----------+-------+
| 1 | 0 | 1 | L003 | 79 |
| 2 | 0.25 | 1 | L002 | 86 |
| 3 | 0.5 | 1 | L004 | 88 |
| 4 | 0.75 | 1 | L005 | 98 |
| 4 | 0.75 | 1 | L001 | 98 |
+----+------+--------+-----------+-------+
cume_dist():
分组内小于、等于当前rank值的行数 / 分组内总行数 eg:查询小于等于当前成绩(score)的比例
--cd1:没有分区,则所有数据均为一组,总行数为8
--cd2:按照lesson_id分成了两组,行数各为4
mysql> SELECT stu_id, lesson_id, score,
-> CUME_DIST() OVER (ORDER BY score) AS cd1,
-> CUME_DIST() OVER (PARTITION BY lesson_id ORDER BY score) AS cd2
-> FROM t_score
-> WHERE lesson_id IN ('L001','L002')
-> ;
+--------+-----------+-------+-------+------+
| stu_id | lesson_id | score | cd1 | cd2 |
+--------+-----------+-------+-------+------+
| 2 | L001 | 84 | 0.125 | 0.25 |
| 1 | L001 | 98 | 0.75 | 0.5 |
| 4 | L001 | 99 | 0.875 | 0.75 |
| 3 | L001 | 100 | 1 | 1 |
| 1 | L002 | 86 | 0.25 | 0.25 |
| 4 | L002 | 88 | 0.375 | 0.5 |
| 2 | L002 | 90 | 0.5 | 0.75 |
| 3 | L002 | 91 | 0.625 | 1 |
+--------+-----------+-------+-------+------+
5.前后函数:lag(expr,n),lead(expr,n)
- 用途:返回位于当前行的前n行(
LAG(expr,n)
)或后n行(LEAD(expr,n)
)的expr的值 - 应用场景:查询前1名同学的成绩和当前同学成绩的差值
mysql> SELECT stu_id, lesson_id, score, pre_score,
-> score-pre_score AS diff
-> FROM(
-> SELECT stu_id, lesson_id, score,
-> LAG(score,1) OVER w AS pre_score
-> FROM t_score
-> WHERE lesson_id IN ('L001','L002')
-> WINDOW w AS (PARTITION BY lesson_id ORDER BY score)) t
-> ;
+--------+-----------+-------+-----------+------+
| stu_id | lesson_id | score | pre_score | diff |
+--------+-----------+-------+-----------+------+
| 2 | L001 | 84 | NULL | NULL |
| 1 | L001 | 98 | 84 | 14 |
| 4 | L001 | 99 | 98 | 1 |
| 3 | L001 | 100 | 99 | 1 |
| 1 | L002 | 86 | NULL | NULL |
| 4 | L002 | 88 | 86 | 2 |
| 2 | L002 | 90 | 88 | 2 |
| 3 | L002 | 91 | 90 | 1 |
+--------+-----------+-------+-----------+------+
6.头尾函数:FIRST_VALUE(expr),LAST_VALUE(expr)
- 用途:返回第一个(
FIRST_VALUE(expr)
)或最后一个(LAST_VALUE(expr)
)expr的值 - 应用场景:截止到当前成绩,按照日期排序查询第1个和最后1个同学的分数
mysql> SELECT stu_id, lesson_id, score, create_time,
-> FIRST_VALUE(score) OVER w AS first_score,
-> LAST_VALUE(score) OVER w AS last_score
-> FROM t_score
-> WHERE lesson_id IN ('L001','L002')
-> WINDOW w AS (PARTITION BY lesson_id ORDER BY create_time)
-> ;
+--------+-----------+-------+-------------+-------------+------------+
| stu_id | lesson_id | score | create_time | first_score | last_score |
+--------+-----------+-------+-------------+-------------+------------+
| 3 | L001 | 100 | 2018-08-07 | 100 | 100 |
| 1 | L001 | 98 | 2018-08-08 | 100 | 98 |
| 2 | L001 | 84 | 2018-08-09 | 100 | 99 |
| 4 | L001 | 99 | 2018-08-09 | 100 | 99 |
| 3 | L002 | 91 | 2018-08-07 | 91 | 91 |
| 1 | L002 | 86 | 2018-08-08 | 91 | 86 |
| 2 | L002 | 90 | 2018-08-09 | 91 | 90 |
| 4 | L002 | 88 | 2018-08-10 | 91 | 88 |
+--------+-----------+-------+-------------+-------------+------------+
二.case专题-商品订单数据
数据表:
订单表orders,大概字段有(order_id'订单号,'user_id‘用户编号’, order_pay‘订单金额’ , order_time‘下单时间’,'商品一级类目commodity_level_I','商品二级类目commodity_level_2')
1.求最近7天内每一个一级类目下成交总额排名前3的二级类目:
select commodity_level_1 as '商品一级类目',
commodity_level_2 as '商品二级类目',
total_pay as '成交总额'
from
(
select commodity_level_1,commodity_level_2,total_pay
row_number()over(partition by commodity_level_1 order by a.total_pay desc) as rank
from
(
select commodity_level_1,
commodity_level_2,
sum(order_pay) as total_pay
from orders
where datediff(now(),order_time) <= 7
group by commodity_level_1,commodity_level_2,
) a
) b
where rank <= 3
2.提取8.1-8.10每一天消费金额排名在101-195的user_id
select order_date as '订单日期',
user_id,
total_pay as '消费金额'
from
(
select order_date,user_id,total_pay
row_number()over(partition by order_date order by a.total_pay desc) as rank
from
(
select convert(order_time,date) as order_date,
user_id,
sum(order_pay) as total_pay
from orders
where convert(order_time,date) between '20190801' and '20190810'
group by convert(order_time,date),user_id
) a
) b
where rank between 101 and 195
三.case专题-活动运营数据分析
数据表
表1——订单表orders,大概字段有(user_id‘用户编号’, order_pay‘订单金额’ , order_time‘下单时间’)
表2——活动报名表act_apply,大概字段有(act_id‘活动编号’, user_id‘报名用户’,act_time‘报名时间’)
1.活动运营数据分析-统计每个活动对应所有用户在报名后产生的总订单金额,总订单数
select act_id as '活动编号', COUNT(*) as '订单数', SUM(order_pay) as '总金额'
from orders LEFT JOIN act_apply on orders.user_id=act_apply.user_id
where order_time >= act_time
GROUP BY act_id
2.统计每个活动从开始后到当天(考试日)平均每天产生的订单数,活动开始时间定义为最早有用户报名的时间。(涉及到时间的数据类型均为:datetime)
解题思路:
(1)找到活动最早用户报名的时间 用窗口函数min()over()
(2)计算最早报名时间到当前时间的天数差,datediff(now(),act_start)
select act_id as '活动编号',count(*)/datediff(now(),act_start)
from orders a
left join
(
select act_id,user_id,act_time,min()over(partition by act_id order act_time) as act_start
from act_apply
)b
on a.user_id = b.user_id
where order_time >= act_time
group by act_id,act_start
四.case专题-用户行为路径分析
表1——用户行为表tracking_log,大概字段有(user_id‘用户编号’,opr_id‘操作编号’,log_time‘操作时间’)
1.统计每天符合以下条件的用户数:A操作之后是B操作,AB操作必须相邻
解题思路:运用窗口函数在每行数据生成一列数据:下次操作编号
注意要把datetime格式转换成date格式来表示天
select Date,count(*) as 'A-B路径用户计数'
from
(
select distinct user_id,opr_id as 'cur_opr',convert(log_time,date) as 'Date',
lead(opr_id,1)over(partition by user_id,convert(log_time,date) order by log_time) as 'next_opr'
from tracking_log
) t
where t.cur_opr = t.next_opr
group by Date
2.统计用户行为序列为A-B-D的用户数
其中:A-B之间可以有任何其他浏览记录(如C,E等),B-D之间除了C记录可以有任何其他浏览记录(如A,E等)
select count(user_id) as result
from
(
select user_id,group_concat(opr_id order by log_time) as user_behavior_path
from tracking_log
group by user_id
having (user_behavior_path like '%A%B%D%')
and (user_behavior_path not like '%A%B%C%D%')
) t
五.case专题-用户留存分析
表1——用户登陆表user_log,大概字段有(user_id‘用户编号’,log_time‘登陆时间’)
1.求每天新增用户数,以及他们第2天、30天的留存率
select a.date as '日期',a.new as '新增用户',
b.2_back / a.new as '第2天回访率',
c.30_back / a.new as '第3天回访率',
from
(
select convert(aa.first_time,date) as 'date',count(distinct user_id) as 'new'
from
(
--找到每个用户第一次登陆时间,按第一次登陆日期聚合,统计user_id,即每天新增用户量
select user_id,min(log_time)over(partition by user_id) as first_time,log_time
from user_log
)aa
group by convert(aa.first_time,date)
)a
left join
(
--计算第二天回访的用户量,计算当前时间与第一次登陆的时间差
select convert(bb.first_time,date) as 'date',count(distinct bb.user_id) as '2_back'
from
(
select user_id,min(log_time)over(partition by user_id) as first_time,log_time
from user_log
where datediff(log_time,first_time) = 1
)bb
group by convert(bb.first_time,date)
)b on a.date = b.date
left join
(
--计算第二天回访的用户量,计算当前时间与第一次登陆的时间差
select convert(cc.first_time,date) as 'date',count(distinct cc.user_id) as '30_back'
from
(
select user_id,min(log_time)over(partition by user_id) as first_time,log_time
from user_log
where datediff(log_time,first_time) = 29
)cc
group by convert(bb.first_time,date)
)c on a.date = c.date
2.找近90天,30天,7天的登录人数
select count(distinct case when datediff(now(),log_time)<=90 then user_id else null end) as 90_log_users,
count(distinct case when datediff(now(),log_time)<=30 then user_id else null end) as 30_log_users,
count(distinct case when datediff(now(),log_time)<=7 then user_id else null end) as 7_log_users
from user_log
3.求用户近一个月平均登录时间间隔(按天)
select user_id,avg(diff)
from
(
select user_id,lead(log_time)over(partition user_id order by log_time) - log_time as diff
from user_log
where datediff(now(),log_time)<=30
)
group by user_id
六.case专题-统计特征(中位数,众数,四分位数)
字段:订单号(order_id),店铺id(shop_id),销量(sale),商品id(commodity_id)
1.求每个店铺订单销量排名的中位数
select avg(sale)
from
(
select sale,
count(*)over(partition by shop_id) as total, --shop对应数据行数
--用来计算偶数中位数
cast(count(*)over(partition by shop_id) as decimal) / 2 as even_mid,
--用来计算奇数中位数
ceiling(cast(count(*)over(partition by shop_id) as decimal)/2)as odd_mid,
row_number()over(partition by shop_id order by sale) as sale_rank
from orders
) t
where ( total%2=0 and sale_rank in (even_mid,even_mid+1)) or ( total%2=1 and sale_rank = mid)
group by shop_id
2.求每个店铺订购商品的众数
预备知识:
all(子查询) 表示子查询中的所有数据 eg: >= all(子查询) 表示大于等于子查询中所有数据
any(子查询)表示子查询中的任一数据 eg: >= any(子查询) 表示只有大于等于子查询中任意一个数据 就返回true
select shop_id,commodity_id,order_num
from
(
select shop_id,commodity_id,order_num,max(order_num)over(partition by shop_id) as max_order_num
(
select shop_id,commodity_id,count(*) as order_num
from orders
group by shop_id,commodity_id
) t1
) t2
where order_num = max_order_num
3.求四分位数
三个表
T1:good_id,cate_id(分类)
T2:mall_id(店铺), good_id
T3:mall_id, credit_score(信用分)
问,在不同分类中,店铺的信用分前100和求top25%
select cate_id,mall_id,credit_score
from
(
select cate_id,mall_id,
credit_score,
row_number()over(partition by cate_id order by credit_score desc) as score_rn
count(*)over(partition by cate_id) as mall_total
from
( --选择商品分类id,店铺id,商品信用分数
select T1.cate_id,T2.mall_id,credit_score
from T1 left join (T2 left join T3 on T2.mall_id = T3.mall_id) on T1.good_id = T2.good_id
group by T1.cate_id,T2.mall_id
)t
)
where score_rn <= mall_total*0.25
七.case专题-GMV周同比统计
字段:时间(sale_date),店铺类别(cate_id),店铺数量(mall_num),gmv
1.拼多多618前后一周内各店铺类别gmv的日均提升幅度和比例
select cate_id,avg(gmv_diff),avg(gmv_rate)
from
(
select sale_date,cate_id,mall_num,gmv,
(lead(gmv,7)over(partition by cate_id order by sale_date)-gmv) as gmv_diff,
(lead(gmv,7)over(partition by cate_id order by sale_date)-gmv) / gmv as gmv_rate
from T
where sale_date between '20190611' and '20190624'
)t
where sale_date between '20190611' and '20190617'
group by cate_id,mall_num,gmv