题目一:
用户-视频互动表tb_user_video_log
id | uid | video_id | start_time | end_time | if_follow | if_like | if_retweet | comment_id |
1 | 101 | 2001 | 2021-09-01 10:00:00 | 2021-09-01 10:00:20 | 0 | 1 | 1 | NULL |
2 | 105 | 2002 | 2021-09-10 11:00:00 | 2021-09-10 11:00:30 | 1 | 0 | 1 | NULL |
3 | 101 | 2001 | 2021-10-01 10:00:00 | 2021-10-01 10:00:20 | 1 | 1 | 1 | NULL |
4 | 102 | 2001 | 2021-10-01 10:00:00 | 2021-10-01 10:00:15 | 0 | 0 | 1 | NULL |
5 | 103 | 2001 | 2021-10-01 11:00:50 | 2021-10-01 11:01:15 | 1 | 1 | 0 | 1732526 |
6 | 106 | 2002 | 2021-10-01 10:59:05 | 2021-10-01 11:00:05 | 2 | 0 | 0 | NULL |
(uid-用户ID, video_id-视频ID, start_time-开始观看时间, end_time-结束观看时间, if_follow-是否关注, if_like-是否点赞, if_retweet-是否转发, comment_id-评论ID)
短视频信息表tb_video_info
id | video_id | author | tag | duration | release_time |
1 | 2001 | 901 | 影视 | 30 | 2021-01-01 07:00:00 |
2 | 2002 | 901 | 美食 | 60 | 2021-01-01 07:00:00 |
3 | 2003 | 902 | 旅游 | 90 | 2020-01-01 07:00:00 |
4 | 2004 | 902 | 美女 | 90 | 2020-01-01 08:00:00 |
(video_id-视频ID, author-创作者ID, tag-类别标签, duration-视频时长, release_time-发布时间)
问题:计算2021年里每个创作者每月的涨粉率及截止当月的总粉丝量
注:
- 涨粉率=(加粉量 - 掉粉量) / 播放量。结果按创作者ID、总粉丝量升序排序。
- if_follow-是否关注为1表示用户观看视频中关注了视频创作者,为0表示此次互动前后关注状态未发生变化,为2表示本次观看过程中取消了关注。
输出示例:
示例数据的输出结果如下
author | month | fans_growth_rate | total_fans |
901 | 2021-09 | 0.500 | 1 |
901 | 2021-10 | 0.250 | 2 |
解释:
示例数据中表tb_user_video_log里只有视频2001和2002的播放记录,都来自创作者901,播放时间在2021年9月和10月;其中9月里加粉量为1,掉粉量为0,播放量为2,因此涨粉率为0.500(保留3位小数);其中10月里加粉量为2,掉份量为1,播放量为4,因此涨粉率为0.250,截止当前总粉丝数为2。
思路:
第一步:计算每人每月的粉丝量,播放量(按照人和月份分组)
粉丝量:1对应的减去-1对应的粉丝和
播放量:count(*)或者count(1),
涨粉率=粉丝量/播放量
第二步:对每月的粉丝量累计sum得到截至粉丝量(计算每个人的粉丝量,按照月累计即可所以partition by人order by月)
select author,month,round(fans_growth_number/counts,3) as fans_growth_rate,
sum(fans_growth_number) over (partition by author order by month) as total_fans
from(select author,DATE_FORMAT(start_time,'%Y-%m') as month,
sum(case when if_follow = 1 then 1
when if_follow = 2 then -1
else 0 end) as fans_growth_number,
count(*) as counts
from tb_user_video_log as t1
inner join tb_video_info as t2
on t1.video_id = t2.video_id
where year(start_time)=2021
group by author,month) t
order by author,total_fans
题目二:
现有试卷作答记录表exam_record(uid用户ID, exam_id试卷ID, start_time开始作答时间, submit_time交卷时间, score得分):
id | uid | exam_id | start_time | submit_time | score |
1 | 1001 | 9001 | 2020-01-01 09:01:01 | 2020-01-01 09:21:59 | 90 |
2 | 1002 | 9001 | 2020-01-20 10:01:01 | 2020-01-20 10:10:01 | 89 |
3 | 1002 | 9001 | 2020-02-01 12:11:01 | 2020-02-01 12:31:01 | 83 |
4 | 1003 | 9001 | 2020-03-01 19:01:01 | 2020-03-01 19:30:01 | 75 |
5 | 1004 | 9001 | 2020-03-01 12:01:01 | 2020-03-01 12:11:01 | 60 |
6 | 1003 | 9001 | 2020-03-01 12:01:01 | 2020-03-01 12:41:01 | 90 |
7 | 1002 | 9001 | 2020-05-02 19:01:01 | 2020-05-02 19:32:00 | 90 |
8 | 1001 | 9002 | 2020-01-02 19:01:01 | 2020-01-02 19:59:01 | 69 |
9 | 1004 | 9002 | 2020-02-02 12:01:01 | 2020-02-02 12:20:01 | 99 |
10 | 1003 | 9002 | 2020-02-02 12:01:01 | 2020-02-02 12:31:01 | 68 |
11 | 1001 | 9002 | 2020-02-02 12:01:01 | 2020-02-02 12:43:01 | 81 |
12 | 1001 | 9002 | 2020-03-02 12:11:01 | (NULL) | (NULL) |
请输出每份试卷每月作答数和截止当月的作答总数。
由示例数据结果输出如下:
exam_id | start_month | month_cnt | cum_exam_cnt |
9001 | 202001 | 2 | 2 |
9001 | 202002 | 1 | 3 |
9001 | 202003 | 3 | 6 |
9001 | 202005 | 1 | 7 |
9002 | 202001 | 1 | 1 |
9002 | 202002 | 3 | 4 |
9002 | 202003 | 1 | 5 |
解释:试卷9001在202001、202002、202003、202005共4个月有被作答记录,每个月被作答数分别为2、1、3、1,截止当月累积作答总数为2、3、6、7。
思路:
select exam_id,
start_month,
month_cnt,
sum(month_cnt) over(partition by exam_id order by start_month) as cum_exam_cnt
from (
select
exam_id,
date_format(start_time,'%Y%m') as start_month,
count(*) as month_cnt
from exam_record
group by exam_id,date_format(start_time,'%Y%m')
)t
步骤:
1.根据试卷和日期进行分组,并且进行计数
2.对计数后的进行求和,按月累计就是order by 月,每份试卷就是partition by 试卷
题目三:
现有试卷作答记录表exam_record(uid用户ID, exam_id试卷ID, start_time开始作答时间, submit_time交卷时间, score得分):
id | uid | exam_id | start_time | submit_time | score |
1 | 1001 | 9001 | 2020-01-01 09:01:01 | 2020-01-01 09:21:59 | 90 |
2 | 1002 | 9001 | 2020-01-20 10:01:01 | 2020-01-20 10:10:01 | 89 |
3 | 1002 | 9001 | 2020-02-01 12:11:01 | 2020-02-01 12:31:01 | 83 |
4 | 1003 | 9001 | 2020-03-01 19:01:01 | 2020-03-01 19:30:01 | 75 |
5 | 1004 | 9001 | 2020-03-01 12:01:01 | 2020-03-01 12:11:01 | 60 |
6 | 1003 | 9001 | 2020-03-01 12:01:01 | 2020-03-01 12:41:01 | 90 |
7 | 1002 | 9001 | 2020-05-02 19:01:01 | 2020-05-02 19:32:00 | 90 |
8 | 1001 | 9002 | 2020-01-02 19:01:01 | 2020-01-02 19:59:01 | 69 |
9 | 1004 | 9002 | 2020-02-02 12:01:01 | 2020-02-02 12:20:01 | 99 |
10 | 1003 | 9002 | 2020-02-02 12:01:01 | 2020-02-02 12:31:01 | 68 |
11 | 1001 | 9002 | 2020-01-02 19:01:01 | 2020-02-02 12:43:01 | 81 |
12 | 1001 | 9002 | 2020-03-02 12:11:01 | (NULL) | (NULL) |
请输出自从有用户作答记录以来,每月的试卷作答记录中月活用户数、新增用户数、截止当月的单月最大新增用户数、截止当月的累积用户数。结果按月份升序输出。
由示例数据结果输出如下:
start_month | mau | month_add_uv | max_month_add_uv | cum_sum_uv |
202001 | 2 | 2 | 2 | 2 |
202002 | 4 | 2 | 2 | 4 |
202003 | 3 | 0 | 2 | 4 |
202005 | 1 | 0 | 2 | 4 |
month | 1001 | 1002 | 1003 | 1004 |
202001 | 1 | 1 |
|
|
202002 | 1 | 1 | 1 | 1 |
202003 | 1 |
| 1 | 1 |
202005 |
| 1 |
|
|
由上述矩阵可以看出,2020年1月有2个用户活跃(mau=2),当月新增用户数为2;
2020年2月有4个用户活跃,当月新增用户数为2,最大单月新增用户数为2,当前累积用户数为4。
思路1:
第一步选出活跃用户:
select date_format(start_time,'%Y%m') as month,count(distinct uid) as 月活用户数
from exam_record
group by month
第二步选出新增用户:
select date_format(start_time,'%Y%m') dt,count(distinct uid) as 新增 from exam_record
where (date_format(start_time,'%Y%m'),uid) in
(
select min(date_format(start_time,'%Y%m')) as dt,uid
from exam_record
group by uid
)
group by dt
第三步:使用月份左连接
select * from
(
select date_format(start_time,'%Y%m') as month,count(distinct uid) as 月活用户数
from exam_record
group by month
) a
left join
(select date_format(start_time,'%Y%m') dt,count(distinct uid) as 新增 from exam_record
where (date_format(start_time,'%Y%m'),uid) in
(
select min(date_format(start_time,'%Y%m')) as dt,uid
from exam_record
group by uid
)
group by dt
) b
on a.month = b.dt
第四步:计算最大和累计
select a.month,a.月活用户数,ifnull(b.新增,0),
max(ifnull(b.新增,0)) over(order by a.month),
sum(ifnull(b.新增,0)) over (order by a.month)
from
(
select date_format(start_time,'%Y%m') as month,count(distinct uid) as 月活用户数
from exam_record
group by month
) a
left join
(select date_format(start_time,'%Y%m') dt,count(distinct uid) as 新增 from exam_record
where (date_format(start_time,'%Y%m'),uid) in
(
select min(date_format(start_time,'%Y%m')) as dt,uid
from exam_record
group by uid
)
group by dt
) b
on a.month = b.dt
思路二:
1.首先用窗口函数构造辅助列new_day(当月登录是否为最早登陆月):
2.计算每个月的new-day就是新增数,对新增计算max,sum分别为最大和累计
SELECT start_month ,#每个月
COUNT(DISTINCT uid) mau, #月活用户数
count(new_day) month_add_uv, #新增用户
MAX(count(new_day))OVER(ORDER BY start_month) max_month_add_uv, #截止当月的单月最大新增用户数
SUM(count(new_day))OVER(ORDER BY start_month) cum_sum_uv
FROM (
SELECT *,DATE_FORMAT(start_time,'%Y%m') start_month,
IF(start_time=MIN(start_time)OVER(PARTITION BY uid),1,null) new_day
FROM exam_record
)t1
GROUP BY start_month;
思路三:
1.对每位用户的月份进行排序,排名为1的月份就是新增月份
select
uid, date_format(start_time, '%Y%m') as start_month
, row_number() over(partition by uid order by start_time) as rn
from exam_record
注意:这里使用row_number,dense_rank,rank均可
2.对每个月的1计数得到的就是每个月的新增
select start_month,
count(distinct uid) as mau
, count(if(rn=1, uid, null)) as month_add_uv
from (
select
uid, date_format(start_time, '%Y%m') as start_month
, row_number() over(partition by uid order by start_time) as rn
from exam_record
) t
group by start_month
3.然后加入max,sum
select
start_month
, count(distinct uid) as mau
, count(if(rn=1, uid, null)) as month_add_uv
, max(count(if(rn=1, uid, null))) over(order by start_month) as max_month_add_uv
, sum(count(if(rn=1, uid, null))) over(order by start_month) as cum_sum_uv
from (
select
uid, date_format(start_time, '%Y%m') as start_month
, row_number() over(partition by uid order by start_time) as rn
from exam_record
) t
group by start_month