基于hive的电影分析可视化项目基于hive的数据分析

转载

feiry 2024-08-14 20:47:59

文章标签 基于hive的电影分析可视化项目大数据 python java html 文章分类 Hive 大数据

一、项目需求

根据给出数据样例统计以下指标
1.统计各商品的浏览数量（PV）
2.统计各商品的访客数（UV）
3.统计商品分类的浏览数（PV）
4.统计商品分类的访客数（UV）
5.统计商品、分类页面的会话数
6.统计各个外部来源渠道的访客数（UV）

二、数据介绍（access.log文件）

access.log为某东网站一天的访问数据，通过制表符“\t”分隔每个字段的数据
其中数据的含义如下：
第1列时间：用户访问网站的时间，即当前日志的时间
第2列 IP地址：IP地址，即用户端的IP地址
第3列访问方法：只有GET/POST 第4列 URL：用户当前访问网站的URL 第5列 HTTP协议
第6列用户ID：唯一的标示了一个用户的身份
第7列 Referer URL：用户上一个访问的URL地址
第8列状态码：即访问状态，如200表示访问成功

三、确定数据结构

启动hadoop集群，分别执行 start-hdfs.sh 和 start-yarn.sh 命令
等待hadoop集群启动后进入hive，执行导入如下命令

1、根据已知日志数据建立日志表

create table log_data(
localtime string,
ip string,
method string,
uuid string,
url string,
http_protorol string,
referer_url string,
status_code string
)
PARTITIONED BY (dt int)
row format delimited fields terminated by '\t';

2、根据项目需求建立结果表

建立fact_goods表

create table fact_goods(
goods_id string,
pv int,
uv int,
session_num int
)
PARTITIONED BY (dt int);

建立fact_category表

create table fact_category(
category_id string,
pv int,
uv int,
session_num int
)
PARTITIONED BY (dt int);

建立fact_source表

create table fact_source(
source_name string,
UV int
)
PARTITIONED BY (dt int);

3、将用户行为日志数据导入日志表

1）将access.log日志上传到虚拟机的/usr/local/qst/logs 2）执行如下代码将日志导入日志表

load data local inpath '/usr/local/qst/logs/access.log' overwrite into table log_data partition(dt=20200604);

四、业务实现

1、数据分析

通过需求的1-4项我们可以知道主要统计商品和分类的指标，通过日志我们能看出来只有两种类型的URL如下：

https://item.jd.com/数字.html
https://list.jd.com/list.html?cat=数字,数字,数字

通过在浏览其中打开URL，我们可以发现如下规律
https://item.jd.com/数字.html 类型的URL打开的都是商品页面
https://list.jd.com/list.html?cat=数字,数字,数字 类型的URL打开的都是商品的分类页面
由此推断第一种类型中的数字为“商品ID”，第二种类型的URL中的数字组合为“分类ID”

2、商品的统计指标实现

1）计算商品指标

insert overwrite table fact_goods partition(dt=20200603) 
select 
regexp_replace(regexp_replace(parse_url(url , 'PATH'),'.html',''),'/','') as goods_id,
count(1) as pv,
count(distinct uuid) as uv,
0 as session_num
from log_data 
where dt = 20200603
and status_code == 200 
and length(uuid) >= 10 
and url  like 'https://item.jd.com/%' 
group by regexp_replace(regexp_replace(parse_url(url , 'PATH'),'.html',''),'/','');

2）计算分类指标

insert overwrite table fact_category partition(dt=20200603) 
select 
parse_url(url,'QUERY','cat') as category_id,
count(1) as pv,
count(distinct uuid) as uv,
0 as session_num
from log_data 
where dt = 20200603
and status_code == 200 
and length(uuid) >= 10 
and url  like 'https://list.jd.com/list.html?cat=%' 
group by parse_url(url,'QUERY','cat') ;

3）代码优化

经过优化后的写法，因为分别计算商品和分类需要执行两次读取操作（log_data），所以这里用 from... insert... 方式只读取一次数据

from(
select
url,
uuid,
regexp_replace(regexp_replace(parse_url(url , 'PATH'),'.html',''),'/','') as goods_id,
parse_url(url,'QUERY','cat') as category_id,
sum(unix_timestamp(localtime)) over(partition by uuid order by unix_timestamp(localtime) rows between current row and 1 following)  - unix_timestamp(localtime) * 2 as stay_time
from log_data
where dt = 20200603
and status_code == 200 
and length(uuid) >= 10 
) tempTable 

insert overwrite table fact_goods partition(dt=20200603) 
select 
goods_id,
count(1) as pv,
count(distinct uuid) as uv,
sum(if(stay_time >= 1800 , 1 , 0 )) as session_num
where url like 'https://item.jd.com/%' 
group by goods_id

insert overwrite table fact_category partition(dt=20200603) 
select 
category_id,
count(1) as pv,
count(distinct uuid) as uv,
sum(if(stay_time >= 1800 , 1 , 0 )) as session_num
where url  like 'https://list.jd.com/list.html?cat=%' 
group by category_id;

3.统计各个外部来源渠道的访客数（UV）

这里通过分析日志表中的“referer_url”确定外部来源，首先判断“referer_url”不为空并且为站外域名，其次取得站外来源的域名

insert overwrite table fact_source partition(dt=20200603) 
select 
parse_url('https://www.baidu.com/s?wd={query}','HOST') as source,
count(distinct uuid) as uv
where referer_url is not null and referer_url <> '-' and UPPER(referer_url) <> 'NULL' and length(referer_url) > 8 
and referer_url not like '%jd.com%' 
and status_code == 200 
and length(uuid) >= 10 
group by parse_url('https://www.baidu.com/s?wd={query}','HOST');

优化后我们结合上面的 from... insert... 语句

from(
select
url,
referer_url,
uuid,
regexp_replace(regexp_replace(parse_url(url , 'PATH'),'.html',''),'/','') as goods_id,
parse_url(url,'QUERY','cat') as category_id,
parse_url(url,'HOST') as source,
sum(unix_timestamp(localtime)) over(partition by uuid order by unix_timestamp(localtime) rows between current row and 1 following)  - unix_timestamp(localtime) * 2 as stay_time
from log_data
where dt = 20200604
and status_code == 200 
and length(uuid) >= 10 
) tempTable 

insert overwrite table fact_goods partition(dt=20200604) 
select 
goods_id,
count(1) as pv,
count(distinct uuid) as uv,
sum(if(stay_time >= 1800 , 1 , 0 )) as session_num
where url like 'https://item.jd.com/%' 
group by goods_id

insert overwrite table fact_category partition(dt=20200604) 
select 
category_id,
count(1) as pv,
count(distinct uuid) as uv,
sum(if(stay_time >= 1800 , 1 , 0 )) as session_num
where url  like 'https://list.jd.com/list.html?cat=%' 
group by category_id 

insert overwrite table fact_source partition(dt=20200604) 
select 
source,
count(distinct uuid) as uv
where referer_url is not null and referer_url <> '-' and referer_url <> 'null' and length(referer_url) > 8 
and referer_url not like '%jd.com%' 
group by source;

五、编写自动运行脚本

1、在/usr/local/qst目录下创建项目目录data_helper，并在data_helper下创建如下目录

bin	存放运行脚本及代码
conf	存放配置文件
data	数据
lib	类库
log	日志文件
tmp	临时文件

2、编写控制脚本main.sh （bin文件夹）
3、编写Hive脚本user_action.sql （文件夹）
4、设置自动运行任务（crontab）

0 3 * * * /usr/local/qst/data_helper/bin/main.sh

参考资料：
hive函数


hive from... insert... 语句
http://blog.chinaunix.net/uid-30041424-id-5766715.html

Oracle开窗函数（Hive参照Oracle的开窗函数实现了相同功能）



UTM参数


linux crontab用法

linux crontab测试
https://tool.lu/crontab

#!/bin/bash和#!/bin/sh是什么意思以及区别


/bin/bash^M: 坏的解释器: 没有那个文件或目录

sed -i 's/\r$//' /usr/local/qst/data_helper/bin/main.sh

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：IOS不触发onLoad ios无法触发onload

下一篇：Android获取设备usbcp0地址安卓获取devicetoken

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯