用户行为数据:埋点
业务交互数据:业务流程产生的登陆 订单 用户 商品 支付 等有关的数据 通常存储在DB中
0.创建gmall数据库
1.创建ODS层
- 原始数据层:外部表,ods_start_log
- 时间日志表:ods_event_log
创建输入数据是LZO,输出是text,支持json解析的分区表
drop table if exists ods_start_log;
CREATE EXTERNAL TABLE `ods_start_log`(`line` string)
PARTITIONED BY (`dt` string)
STORED AS
INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/warehouse/gmall/ods/ods_start_log';
因为我们的HDFS上元数据的存储格式就是Lzo,所以hive表的输入lzo
因为元数据是一个json字符串需要进行进一步的解析,所以输出用text
ODS层数据加载脚本编写,企业开发中脚本执行时间一般在每日凌晨30分至1点
注意:
[ -n 变量值 ] 判断变量的值,是否为空
-- 变量的值,非空,返回true
-- 变量的值,为空,返回false
2.创建DWD层
DWD层数据解析:对ODS层数据进行清洗(去除空值,脏数据,超过极限范围的数据,行式存储改为列存储,改压缩格式)
2.1创建基础明细表(总共两张表start/event)
明细表用于存储ODS层原始表转换过来的明细数据
基础明细表分析
启动/事件日志基础明细表数据:
drop table if exists dwd_base_start_log;
CREATE EXTERNAL TABLE `dwd_base_start_log`(
`mid_id` string,
`user_id` string,
`version_code` string,
`version_name` string,
`lang` string,
`source` string,
`os` string,
`area` string,
`model` string,
`brand` string,
`sdk_version` string,
`gmail` string,
`height_width` string,
`app_time` string,
`network` string,
`lng` string,
`lat` string,
`event_name` string,
`event_json` string,
`server_time` string)
PARTITIONED BY (`dt` string)
stored as parquet
location '/warehouse/gmall/dwd/dwd_base_start_log/';
存储格式为parquet,其中event_name和event_json用来对应事件名和整个事件
这个地方将原始日志1对1的形式拆分出来,操作的时候我们将原始日志展平,需要用到UDF和UDTF
自定义UDF函数(一进一出):
用于解析公共字段:BaseFieldUDF extends UDF
:将公共字段解析为以“\t”为分隔符的字符串
自定义UDTF(一进多出):
用于将et字段的Json串解析成event_name 和event_jsonEventJsonUDTF extends GenericUDTF
:见简书 https://www.jianshu.com/p/49a11951eb30
解析启动日志基础明细表:set hive.exec.dynamic.partition.mode=nonstrict;
设置动态分区为非严格模式
insert overwrite table dwd_base_start_log
PARTITION (dt)
select
mid_id,
user_id,
version_code,
version_name,
lang,
source ,
os ,
area ,
model ,
brand ,
sdk_version ,
gmail ,
height_width ,
app_time ,
network ,
lng ,
lat ,
event_name ,
event_json ,
server_time ,
dt
from
(
select
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[0] as mid_id,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[1] as user_id,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[2] as version_code,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[3] as version_name,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[4] as lang,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[5] as source,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[6] as os,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[7] as area,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[8] as model,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[9] as brand,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[10] as sdk_version,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[11] as gmail,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[12] as height_width,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[13] as app_time,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[14] as network,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[15] as lng,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[16] as lat,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[17] as ops,
split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[18] as server_time,
dt
from ods_start_log where dt='2019-02-10' and base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la')<>''
) sdk_log lateral view flat_analizer(ops) tmp_k as event_name, event_json;
字段如下:
DWD层明细表:
get_json_object(event_json,'$.kv.action') action
get_json_object(event_json,'$.kv.newsid') newsid
get_json_object(event_json,'$.kv.place') place
get_json_object(event_json,'$.kv.extend1') extend1
1. 商品点击表
2. 商品详情页表
3. 商品列表页
4. 广告表
5. 消息通知表
6.用户前台活跃(event_name='active_foreground')
7. 用户后台活跃(event_name='active_background')
8. 评论表(event_name='comment')
9. 收藏表
10. 点赞表
11. 启动日志表
12. 错误日志表 dwd_error_log
DWD层明细表脚本编写,脚本执行时间一般在凌晨0点30分至1点
业务知识储备:
业务术语: https://www.jianshu.com/p/9dccbaa8e42c
系统函数:
把同一个分组的不同行的数据聚合成一个集合collect_set;
hive (gmall)> select * from stud;
stud.name stud.area stud.course stud.score
zhang3 bj math 88
li4 bj math 99
wang5 sh chinese 92
zhao6 sh chinese 54
tian7 bj chinese 91
hive(gmall)>select course,collect_set(area),avg(score)
from stud
group by course;
chinese ["sh","bj"] 79.0
math ["bj"] 93.5
日期处理函数:
(1)date_format函数(根据格式整理日期)
>select date_format('2019-02-10','yyyy-MM');
2019-02
(2)date_add函数(加减日期)
select date_add('2019-02-10',-1);
20019-02-09
(3)next_day函数
取当前天的下周一
>select next_day('2019-02-12','MO');
2019-02-18
取当前周的周一
>select date_add(next_day('2019-02-12','MO'),-7);
2019-02-11
(4)last_day函数(求当月最后一天日期)
>select last_day('2019-02-10');
2019-02-28
用户需求:用户活跃主题
DWS层
每日活跃设备明细:dws_uv_detail_day
每周活跃用户层:dws_uv_detail_wk
(mid monday_date sunday_date)
insert overwrite table dws_uv_datail_wk partition (wk_dt)
每月设备活跃明细:dws_uv_detail_mn
DWS层数据加载脚本编写,执行一般在每日凌晨30分至1点
ADS层
目标:当日 当周 当月活跃设备数量ads_uv_count
建表语句
drop table if exists ads_uv_count;
create external table ads_uv_count(
`dt` string COMMENT '统计日期',
`day_count` bigint COMMENT '当日用户数量',
`wk_count` bigint COMMENT '当周用户数量',
`mn_count` bigint COMMENT '当月用户数量',
`is_weekend` string COMMENT 'Y,N是否是周末,用于得到本周最终结果',
`is_monthend` string COMMENT 'Y,N是否是月末,用于得到本月最终结果'
) COMMENT '每日活跃用户数量'
stored as parquet
location '/warehouse/gmall/ads/ads_uv_count_day/';
用户需求:用户新增主题
DWS层(每日新增设备明细)
每日新增设备明细表 dws_new_mid_day
(主要字段:mid_id create_date)
导入数据:
insert into table dws_new_mid_day
select
ud.mid_id,
ud.user_id ,
ud.version_code ,
ud.version_name ,
ud.lang ,
ud.source,
ud.os,
ud.area,
ud.model,
ud.brand,
ud.sdk_version,
ud.gmail,
ud.height_width,
ud.app_time,
ud.network,
ud.lng,
ud.lat,
'2019-02-10'
from dws_uv_detail_day ud left join dws_new_mid_day nm on ud.mid_id=nm.mid_id
where ud.dt='2019-02-10' and nm.mid_id is null;
ADS层(每日新增设备表) :
drop table if exists `ads_new_mid_count`;
create table `ads_new_mid_count`(
create_date String comment '创建时间',
new_mid_count BigInt comment '新增设备数量'
)
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_new_mid_count/';
导入数据:
insert into table ads_new_mid_count
select create_date,count(*)
from dws_new_mid_day
where create_date='2019-02-10'
group by create_date
用户需求:用户留存
用户留存概念:某段时间内的新增用户,经过一段时间后,又继续使用的用户
DWS(每日留存用户明细表dws_user_retention_day):
按天进行分区保存,每天计算一次前n天的留存明细
drop table if exists `dws_user_retention_day`;
create table `dws_user_retention_day`
(
`mid_id` string COMMENT '设备唯一标识',
`user_id` string COMMENT '用户标识',
`version_code` string COMMENT '程序版本号',
`version_name` string COMMENT '程序版本名',
`lang` string COMMENT '系统语言',
`source` string COMMENT '渠道号',
`os` string COMMENT '安卓系统版本',
`area` string COMMENT '区域',
`model` string COMMENT '手机型号',
`brand` string COMMENT '手机品牌',
`sdk_version` string COMMENT 'sdkVersion',
`gmail` string COMMENT 'gmail',
`height_width` string COMMENT '屏幕宽高',
`app_time` string COMMENT '客户端日志产生时的时间',
`network` string COMMENT '网络模式',
`lng` string COMMENT '经度',
`lat` string COMMENT '纬度',
`create_date` string comment '设备新增时间',
`retention_day` int comment '截止当前日期留存天数'
) COMMENT '每日用户留存情况'
PARTITIONED BY ( `dt` string)
stored as parquet
location '/warehouse/gmall/dws/dws_user_retention_day/';
导入数据:
2019-02-11当天计算前一天的用户留存人明细计算(count(*)即可算出留存人数):
insert overwrite table dws_user_retention_day partition(dt = "2019-02-10")
select
nm.mid_id,
nm.user_id ,
nm.version_code ,
nm.version_name ,
nm.lang ,
nm.source,
nm.os,
nm.area,
nm.model,
nm.brand,
nm.sdk_version,
nm.gmail,
nm.height_width,
nm.app_time,
nm.network,
nm.lng,
nm.lat,
nm.create_date,
1 retention_day
from dws_uv_detail_day ud
join dws_new_mid_day nm
on ud.mid_id=nm.mid_id
where ud.dt='2019-02-11' #2-11仍然在活跃状态的用户
and nm.create_date=date_add('2019-02-11',-1); #2-10号创建的账户
依次可以算出前2天 前三天,前四天....一直到前n天的用户留存明细表,最后union all并insert into
每日留存用户明细表中即可
ADS层用户留存数:
创建表(每日用户留存表ads_user_retention_day_count)
drop table if exists `ads_user_retention_day_count`;
create table `ads_user_retention_day_count`
(
`create_date` string comment '设备新增日期',
`retention_day` int comment '截止当前日期留存天数',
retention_count bigint comment '留存数量'
) COMMENT '每日用户留存情况'
stored as parquet
location '/warehouse/gmall/ads/ads_user_retention_day_count/';
导入数据:
insert into table ads_user_retention_day_count
select
create_date,
retention_day,
count(*) retention_count
from dws_user_retention_day
where dt='2019-02-11'
group by create_date,retention_day;
留存用户比率(建表ads_user_retention_day_rate):
create table `ads_user_retention_day_rate`
(
`stat_date` string comment '统计日期',
`create_date` string comment '设备新增日期',
`retention_day` int comment '截止当前日期留存天数',
`retention_count` bigint comment '留存数量',
`new_mid_count` string comment '当日设备新增数量',
`retention_ratio` decimal(10,2) comment '留存率'
) COMMENT '每日用户留存情况'
stored as parquet
location '/warehouse/gmall/ads/ads_user_retention_day_count/';
导入数据:
insert into table ads_user_retention_day_rate
select
'2019-02-11' ,
ur.create_date,
ur.retention_day,
ur.retention_count ,
nc.new_mid_count,
ur.retention_count/nc.new_mid_count*100
from
(
select
create_date,
retention_day,
count(*) retention_count
from `dws_user_retention_day`
where dt='2019-02-11'
group by create_date,retention_day
) ur join ads_new_mid_count nc on nc.create_date=ur.create_date;
在线教育项目(李国龙老师计算七日留存率):
create table tmp.seven_days_retained_analysis_${day}(
register_day INT,
zero_interval_retained_rate DOUBLE,
one_interval_retained_rate DOUBLE,
two_interval_retained_rate DOUBLE,
three_interval_retained_rate DOUBLE,
four_interval_retained_rate DOUBLE,
five_interval_retained_rate DOUBLE,
six_interval_retained_rate DOUBLE,
dt INT
)row format delimited fields terminated by '\t';
SQL实现:
//获取近7天全部用户的注册信息
select
uid,
dt as register_day,
event_time
from dwd.user_behavior
where dt between ${startDay} and ${endDay}
and event_key = "registerAccount"
//获取近7天每日活跃的用户列表
select
uid,
dt as active_day,
max(event_time) as event_time
from dwd.user_behavior
where dt between ${startDay} and ${endDay}
group by uid,dt
//两者整合,生成uid register_day active_day,interval(活跃时距离注册日期几天)
select
t1.uid,
t1.register_day,
t2.active_day,
datediff(from_unixtime(t2.event_time,"yyyy-MM-dd"),from_unixtime(t1.event_time,"yyyy-MM-dd")) as day_interval
from(
select uid,dt as register_day,event_time
from dwd.user_behavior
where dt between ${startDay} and ${endDay}
and event_key = "registerAccount") t1
left join(
select uid,dt as active_day,max(event_time) as event_time
from dwd.user_behavior
where dt between ${startDay} and ${endDay}
group by uid,dt) t2
on t1.uid = t2.uid
//根据以上的表生成留存用户数临时表
drop table if exists tmp.user_retained_${startDay}_${endDay};
create table if not exists tmp.user_retained_${startDay}_${endDay}
as
select register_day,day_interval,count(1) as retained
from (
select
t1.uid,t1.register_day,t2.active_day,
datediff(from_unixtime(t2.event_time,"yyyy-MM-dd"),from_unixtime(t1.event_time,"yyyy-MM-dd")) as day_interval
from(
select uid,dt as register_day,event_time from dwd.user_behavior where dt between ${startDay} and ${endDay} and event_key = "registerAccount") t1
left join(
select uid,dt as active_day,max(event_time) as event_time
from dwd.user_behavior
where dt between ${startDay} and ${endDay}
group by uid,dt) t2
on t1.uid = t2.uid) tmp
group by register_day,day_interval
结果:
20190402 0 27000
20190402 1 19393
20190402 2 14681
20190402 3 9712
20190402 4 5089
20190402 5 1767
20190402 6 1775
//计算7日留存率
drop table if exists tmp.user_retained_rate_${startDay}_${endDay};
create table if not exists tmp.user_retained_rate_${startDay}_${endDay}
as
select register_day,day_interval,
round(retained / register_cnt,4) as retained_rate,
current_dt from (
select t1.register_day,t1.day_interval,t1.retained,t2.register_cnt,${endDay} as current_dt
from
(select register_day,day_interval,retained
from tmp.user_retained_${startDay}_${endDay}) t1
left join
(select dt,count(1) as register_cnt
from dwd.user_behavior
where dt between ${startDay} and ${endDay}
and event_key = "registerAccount"
group by dt) t2
on t1.register_day = t2.dt
group by t1.register_day,t1.day_interval ,t1.retained,t2.register_cnt) tmp2
//列转行
insert overwrite table tmp.seven_days_retained_analysis_${day}
select
register_day,
max(case when day_interval = 0 then retained_rate else 0 end) as zero_interval_retained_rate,
max(case when day_interval = 1 then retained_rate else 0 end) as one_interval_retained_rate,
max(case when day_interval = 2 then retained_rate else 0 end) as two_interval_retained_rate,
max(case when day_interval = 3 then retained_rate else 0 end) as three_interval_retained_rate,
max(case when day_interval = 4 then retained_rate else 0 end) as four_interval_retained_rate,
max(case when day_interval = 5 then retained_rate else 0 end) as five_interval_retained_rate,
max(case when day_interval = 6 then retained_rate else 0 end) as six_interval_retained_rate,
current_dt
from tmp.user_retained_rate_${startDay}_${endDay}
group by register_day,current_dt;