一、项目需求

根据给出数据样例统计以下指标
1.统计各商品的浏览数量(PV
2.统计各商品的访客数(UV
3.统计商品分类的浏览数(PV
4.统计商品分类的访客数(UV
5.统计商品、分类页面的会话数
6.统计各个外部来源渠道的访客数(UV

二、数据介绍(access.log文件)

access.log为某东网站一天的访问数据,通过制表符“\t”分隔每个字段的数据
其中数据的含义如下:
第1列 时间:用户访问网站的时间,即当前日志的时间
第2列 IP地址:IP地址,即用户端的IP地址
第3列 访问方法:只有GET/POST 第4列 URL:用户当前访问网站的URL 第5列 HTTP协议
第6列 用户ID:唯一的标示了一个用户的身份
第7列 Referer URL:用户上一个访问的URL地址
第8列 状态码:即访问状态,如200表示访问成功

三、确定数据结构

启动hadoop集群,分别执行 start-hdfs.shstart-yarn.sh 命令
等待hadoop集群启动后进入hive,执行导入如下命令

1、根据已知日志数据建立日志表

create table log_data(
localtime string,
ip string,
method string,
uuid string,
url string,
http_protorol string,
referer_url string,
status_code string
)
PARTITIONED BY (dt int)
row format delimited fields terminated by '\t';

2、根据项目需求建立结果表

建立fact_goods表
create table fact_goods(
goods_id string,
pv int,
uv int,
session_num int
)
PARTITIONED BY (dt int);
建立fact_category表
create table fact_category(
category_id string,
pv int,
uv int,
session_num int
)
PARTITIONED BY (dt int);
建立fact_source表
create table fact_source(
source_name string,
UV int
)
PARTITIONED BY (dt int);

3、将用户行为日志数据导入日志表

1)将access.log日志上传到虚拟机的/usr/local/qst/logs 2)执行如下代码将日志导入日志表

load data local inpath '/usr/local/qst/logs/access.log' overwrite into table log_data partition(dt=20200604);

四、业务实现

1、数据分析

通过需求的1-4项我们可以知道主要统计商品和分类的指标,通过日志我们能看出来只有两种类型的URL如下:

https://item.jd.com/数字.html
https://list.jd.com/list.html?cat=数字,数字,数字

通过在浏览其中打开URL,我们可以发现如下规律
https://item.jd.com/数字.html 类型的URL打开的都是商品页面
https://list.jd.com/list.html?cat=数字,数字,数字 类型的URL打开的都是商品的分类页面
由此推断第一种类型中的数字为“商品ID”,第二种类型的URL中的数字组合为“分类ID

2、商品的统计指标实现

1)计算商品指标
insert overwrite table fact_goods partition(dt=20200603) 
select 
regexp_replace(regexp_replace(parse_url(url , 'PATH'),'.html',''),'/','') as goods_id,
count(1) as pv,
count(distinct uuid) as uv,
0 as session_num
from log_data 
where dt = 20200603
and status_code == 200 
and length(uuid) >= 10 
and url  like 'https://item.jd.com/%' 
group by regexp_replace(regexp_replace(parse_url(url , 'PATH'),'.html',''),'/','');
2)计算分类指标
insert overwrite table fact_category partition(dt=20200603) 
select 
parse_url(url,'QUERY','cat') as category_id,
count(1) as pv,
count(distinct uuid) as uv,
0 as session_num
from log_data 
where dt = 20200603
and status_code == 200 
and length(uuid) >= 10 
and url  like 'https://list.jd.com/list.html?cat=%' 
group by parse_url(url,'QUERY','cat') ;
3)代码优化

经过优化后的写法,因为分别计算商品和分类需要执行两次读取操作(log_data),所以这里用 from... insert... 方式只读取一次数据

from(
select
url,
uuid,
regexp_replace(regexp_replace(parse_url(url , 'PATH'),'.html',''),'/','') as goods_id,
parse_url(url,'QUERY','cat') as category_id,
sum(unix_timestamp(localtime)) over(partition by uuid order by unix_timestamp(localtime) rows between current row and 1 following)  - unix_timestamp(localtime) * 2 as stay_time
from log_data
where dt = 20200603
and status_code == 200 
and length(uuid) >= 10 
) tempTable 

insert overwrite table fact_goods partition(dt=20200603) 
select 
goods_id,
count(1) as pv,
count(distinct uuid) as uv,
sum(if(stay_time >= 1800 , 1 , 0 )) as session_num
where url like 'https://item.jd.com/%' 
group by goods_id

insert overwrite table fact_category partition(dt=20200603) 
select 
category_id,
count(1) as pv,
count(distinct uuid) as uv,
sum(if(stay_time >= 1800 , 1 , 0 )) as session_num
where url  like 'https://list.jd.com/list.html?cat=%' 
group by category_id;

3.统计各个外部来源渠道的访客数(UV)

这里通过分析日志表中的“referer_url”确定外部来源,首先判断“referer_url”不为空并且为站外域名,其次取得站外来源的域名

insert overwrite table fact_source partition(dt=20200603) 
select 
parse_url('https://www.baidu.com/s?wd={query}','HOST') as source,
count(distinct uuid) as uv
where referer_url is not null and referer_url <> '-' and UPPER(referer_url) <> 'NULL' and length(referer_url) > 8 
and referer_url not like '%jd.com%' 
and status_code == 200 
and length(uuid) >= 10 
group by parse_url('https://www.baidu.com/s?wd={query}','HOST');

优化后我们结合上面的 from... insert... 语句

from(
select
url,
referer_url,
uuid,
regexp_replace(regexp_replace(parse_url(url , 'PATH'),'.html',''),'/','') as goods_id,
parse_url(url,'QUERY','cat') as category_id,
parse_url(url,'HOST') as source,
sum(unix_timestamp(localtime)) over(partition by uuid order by unix_timestamp(localtime) rows between current row and 1 following)  - unix_timestamp(localtime) * 2 as stay_time
from log_data
where dt = 20200604
and status_code == 200 
and length(uuid) >= 10 
) tempTable 

insert overwrite table fact_goods partition(dt=20200604) 
select 
goods_id,
count(1) as pv,
count(distinct uuid) as uv,
sum(if(stay_time >= 1800 , 1 , 0 )) as session_num
where url like 'https://item.jd.com/%' 
group by goods_id

insert overwrite table fact_category partition(dt=20200604) 
select 
category_id,
count(1) as pv,
count(distinct uuid) as uv,
sum(if(stay_time >= 1800 , 1 , 0 )) as session_num
where url  like 'https://list.jd.com/list.html?cat=%' 
group by category_id 

insert overwrite table fact_source partition(dt=20200604) 
select 
source,
count(distinct uuid) as uv
where referer_url is not null and referer_url <> '-' and referer_url <> 'null' and length(referer_url) > 8 
and referer_url not like '%jd.com%' 
group by source;

五、编写自动运行脚本

1、在/usr/local/qst目录下创建项目目录data_helper,并在data_helper下创建如下目录

bin

存放运行脚本及代码

conf

存放配置文件

data

数据

lib

类库

log

日志文件

tmp

临时文件

2、编写控制脚本main.shbin文件夹)
3、编写Hive脚本user_action.sql (文件夹)
4、设置自动运行任务(crontab

0 3 * * * /usr/local/qst/data_helper/bin/main.sh
参考资料:
hive函数


hive from... insert... 语句
http://blog.chinaunix.net/uid-30041424-id-5766715.html

Oracle开窗函数(Hive参照Oracle的开窗函数实现了相同功能)



UTM参数


linux crontab用法

linux crontab测试
https://tool.lu/crontab

#!/bin/bash和#!/bin/sh是什么意思以及区别


/bin/bash^M: 坏的解释器: 没有那个文件或目录

sed -i 's/\r$//' /usr/local/qst/data_helper/bin/main.sh