Hive/Spark小练习-电子商务数据分析
背景
- 基于Hive或Spark SQL进行使用练习–我们可以用hive或spark做什么
- 某零售企业根据最近一年门店收集的数据进行数据分析展示
数据类型介绍及准备
- 环境准备
- 使用Zeppelin、Spark SQL进行编写,因为我的zeppelin上不支持hive
- 验证是否可以正常使用,如图,没问题
- 数据文件介绍
- customer table–数据可能存在瑕疵
filed | details |
customers_id | Int,1-500 |
first_name | string |
last_name | string |
email | string,such as willddy@mail.com |
gender | string,Male or female |
address | string |
country | string |
language | string |
job | string, job title/position |
credit_type | string,credit card type, such as visa |
credit_no | string,credit card number |
- Transaction Table–有些数据是重复的,但是不能随便删除,所以需要数据修复
filed | details |
transaction_id | Int, 1-1000 |
customer_id | Int, 1-500 |
store_id | Int, 1-5 |
price | decimal, such as 5.08 |
product | string, things bought |
date | string, when to purchase |
time | string, what time to purchase |
- Store Table
filed | details |
store_id | Int, 1-5 |
store_name | string |
employee_number | Int, how many employee in this store |
- Review Table–有些数据是null,有些是错误的映射
filed | details |
stransaction_id | Int, 1-8000 |
store_id | Int,1-5 |
review_store | Int,1-5 |
- 数据准备
- 下载数据–我直接放我云盘里了;检查数据
- 链接:https://pan.baidu.com/s/11fa1wf0HC_I5DEW5WBor3A
提取码:c2jy - 查了一下没问题
- 上传数据到hdfs里–如果在zeppelin里直接放就行了,如果在终端里放记得切换成zeppelin管理员
建表及数据验证
创建外部表来保存数据
%spark2.sql
create external table if not exists ext_customer_details (
customer_id string,
first_name string,
last_name string,
email string,
gender string,
address string,
country string,
language string,
job string,
credit_type string,
credit_no string
)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
location '/tmp/shopping/data/customer'
tblproperties ("skip.header.line.count"="1")
%spark2.sql
create external table if not exists ext_transaction_details(
transaction_id string,
customer_id string,
store_id string,
price decimal(8,2),
product string,
purchase_date string,
purchase_time string
)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
location '/tmp/shopping/data/transaction'
tblproperties ("skip.header.line.count"="1")
%spark2.sql
create external table if not exists ext_store_details(
store_id string,
store_name string,
employee_number int
)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
location '/tmp/shopping/data/store' --this must tblproperties
tblproperties ("skip.header.line.count"="1")
%spark2.sql
create external table if not exists ext_store_review(
transaction_id string,
store_id string,
review_score int
)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
location '/tmp/shopping/data/review' --this must tblproperties
tblproperties ("skip.header.line.count"="1")
都进去了[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传
数据清洗
要求:
- transaction_details中的重复数据, 对重复数据生成新ID
- 过滤掉store_review中没有评分的数据
- 可以把清洗好的数据放到另一个表或者用View表示
- 找出PII(persional information identification)或PCI(personal confidential information)数据进行加密或hash
- 重新组织transaction数据按照日期YYYY-MM做分区
过程
- 对部分姓名,卡证信息进行加密,效果如下所示
- 需要进行一些数据修复,例如这些数据
- 即将完成
- 结果测试,ok
- 清理另一个表
数据分析
- 面向客户的数据分析
- 找出客户最常用的卡
select credit_type, count(distinct credit_no) as credit_cnt from vw_customer_details group by country, credit_type order by credit_cnt desc
- 找出客户数据中的前5大标题
select job, count(*) as pn from vw_customer_details group by job order by pn desc limit 5
- 对美国女性来说,她们中最流行持有什么卡
select credit_type, count(*) as ct from vw_customer_details where country = 'United States' and gender = 'Female'
- 按性别和国家进行客户统计
select count(*), country, gender from vw_customer_details group by country, gender
- 面向流水的分析
- 计算总每月、每季度、每年、每周收益(算是四个小题)
select sum(price) as revenue_mom, purchase_month from transaction_details group by purchase_month order by purchase_month
--
with base as (
select price,
concat_ws('-', substr(purchase_date, 1, 4), cast(ceil(month(purchase_date)/3.0) as string)) as year_quarter
from
shopping.transaction_details
)
select
sum(price) as revenue_qoq, --quarter does not support until hive 1.3
year_quarter
from base group by year_quarter
--
select sum(price) as revenue_mom, substr(purchase_date, 1, 4) as year from transaction_details group by substr(purchase_date, 1, 4)
--
select sum(price) as revenue_wow, date_format(purchase_date, 'u') as weekday from transaction_details group by date_format(purchase_date, 'u') --1 is Mon and 7 is Sun
- 根据时间分桶计算收益、花销
with base as (
select price, purchase_time, if(purchase_time like '%M', from_unixtime(unix_timestamp(purchase_time,'hh:mm aa'),'HH:mm'), purchase_time) as time_format from transaction_details
),
timeformat as (
select
purchase_time, price, (cast(split(time_format, ':')[0] as decimal(4,2)) + cast(split(time_format, ':')[1] as decimal(4,2))/60) as purchase_time_in_hrs
from base
),
timebucket as (
select
price, purchase_time, purchase_time_in_hrs,
if(purchase_time_in_hrs > 5 and purchase_time_in_hrs <=8, 'early morning',
if(purchase_time_in_hrs > 8 and purchase_time_in_hrs <=11, 'morning',
if(purchase_time_in_hrs > 11 and purchase_time_in_hrs <=13, 'noon',
if(purchase_time_in_hrs > 13 and purchase_time_in_hrs <=18, 'afternoon',
if(purchase_time_in_hrs > 18 and purchase_time_in_hrs <=22, 'evening', 'night'))))) as time_bucket
from timeformat
)
--select * from timebucket --this is to verify
select time_bucket, avg(price) as avg_spend, sum(price)/1000 as revenue_k from timebucket group by time_bucket
- 根据每周计算花销
select
avg(price) as avg_price,
date_format(purchase_date, 'u') as weekday
from transaction_details
where
date_format(purchase_date, 'u') is not null
group by date_format(purchase_date, 'u')
- 计算每年、月、周的交易数量
with base as (
select
transaction_id, date_format(purchase_date, 'u') as weekday, purchase_month,
concat_ws('-', substr(purchase_date, 1, 4), cast(ceil(month(purchase_date)/3.0) as string)) as year_quarter, substr(purchase_date, 1, 4) as year
from transaction_details
where purchase_month is not null
)
select count(distinct transaction_id) as total, weekday, purchase_month, year_quarter, year
from base group by weekday, purchase_month, year_quarter, year order by year, purchase_month
- 找出交易最多的10个顾客
with base as (
select
customer_id,
count(distinct transaction_id) as trans_cnt,
sum(price) as spend_total
from transaction_details
where purchase_month is not null
group by customer_id
),
cust_detail as (
select
*,
concat_ws(' ', first_name, last_name) as cust_name
from
base td join vw_customer_details cd on td.customer_id = cd.customer_id
)
select
trans_cnt,
cust_name as top10_trans_cust
from cust_detail order by trans_cnt desc limit 10
- 找出花销做多的10个客户
with base as (
select
customer_id,
count(distinct transaction_id) as trans_cnt,
sum(price) as spend_total
from transaction_details
where purchase_month is not null
group by customer_id
),
cust_detail as (
select
*,
concat_ws(' ', first_name, last_name) as cust_name
from
base td join vw_customer_details cd on td.customer_id = cd.customer_id
)
select
spend_total,
cust_name as top10_trans_cust
from cust_detail order by spend_total desc limit 10
- 该交易周期里谁的交易数量最少
with base as (
select
customer_id,
count(distinct transaction_id) as trans_cnt
from
transaction_details
where purchase_month is not null
group by customer_id
)
select
*
from
base order by trans_cnt limit 10
- 计算每个季度、每周的不重复(唯一、unique customer)顾客的数量
with base as (
select
transaction_id,
concat_ws('-', substr(purchase_date, 1, 4), cast(ceil(month(purchase_date)/3.0) as string)) as year_quarter, substr(purchase_date, 1, 4) as year
from transaction_details
where purchase_month is not null
)
select
count(distinct transaction_id) as total, year_quarter, year
from base
group by year_quarter, year
order by year_quarter
- 计算所有活动的顾客的最大平均值
with base as (
select
customer_id,
avg(price) as price_avg,
max(price) as price_max
from
transaction_details
where
purchase_month is not null
group by customer_id
)
select
max(price_avg)
from base
- 每月花销最大的是谁
with base as (
select customer_id, purchase_month, sum(price) as price_sum, count(transaction_id) as trans_cnt
from transaction_details
where purchase_month is not null group by purchase_month, customer_id
),
rank_sum as (
select
rank() over(partition by purchase_month order by price_sum desc) as rn_sum,
rank() over(partition by purchase_month order by trans_cnt desc) as rn_cnt,
purchase_month,
price_sum,
trans_cnt,
customer_id
from base
)
select
purchase_month,
'spend' as measure_name,
price_sum as measure_value,
customer_id
from rank_sum where rn_sum = 1
union all
select
purchase_month,
'visit' as measure_name,
trans_cnt as measure_value,
customer_id
from rank_sum where rn_cnt = 1
order by measure_name, purchase_month
- 根据总价格、购买频率、顾客数量找出5个最受欢迎的产品
select
product,
sum(price) as price_sum
from
transaction_details
where
purchase_month is not null
group by product
order by price_sum desc limit 5
--
select
product,
count(transaction_id) as freq_buy
from
transaction_details
where
purchase_month is not null
group by product
order by freq_buy desc limit 5
--
select
product,
count(customer_id) as freq_cust
from
transaction_details
where
purchase_month is not null
group by product
order by freq_cust desc limit 5
- 验证前5的细节
select * from transaction_details where product in ('Goat - Whole Cut')
- 面向门面的分析
- 根据-不重复顾客id和进店次数、-顾客的购买次数、-顾得的交易 找出最受欢迎的门店(算3个题)
select
sd.store_name,
count(distinct customer_id) as unique_visit
from
transaction_details td join ext_store_details sd on td.store_id = sd.store_id
group by store_name order by unique_visit desc
limit 5
--
select
sd.store_name,
sum(td.price) as total_revnue
from
transaction_details td join ext_store_details sd on td.store_id = sd.store_id
group by store_name
order by total_revnue desc
limit 5
--
select
sd.store_name,
count(transaction_id) as unique_purchase
from
transaction_details td join ext_store_details sd on td.store_id = sd.store_id
group by store_name order by unique_purchase desc
limit 5
- 根据门店和顾客id找出最受欢迎的产品
with base as (select
store_id,
product,
count(distinct customer_id) as freq_cust
from
transaction_details
where
purchase_month is not null
group by store_id, product
),
prod_rank as (
select
store_id,
product,
freq_cust,
rank() over(partition by store_id order by freq_cust desc) as rn
from base
)
select store_name, product, freq_cust
from prod_rank td join ext_store_details sd on td.store_id = sd.store_id
where td.rn = 1
- 计算每个店的员工与顾客访问比率(应该员工数/顾客访问次数的意思)
with base as (
select
store_id,
count(distinct customer_id, purchase_date) as cust_visit
from
transaction_details
where
purchase_month is not null
group by store_id
)
select
store_name,
cust_visit,
employee_number,
round(cust_visit/employee_number,2) as cust_per_employee_within_period
from base td join ext_store_details sd on td.store_id = sd.store_id
- 计算每个店每月每年的收益
select
store_name,
purchase_month,
sum(price) as revenue
from
transaction_details td join ext_store_details sd on td.store_id = sd.store_id
where
purchase_month is not null
group by store_name, purchase_month
- 制作每个店的总收益饼图
select
store_name,
sum(price) as revenue
from
transaction_details td join ext_store_details sd on td.store_id = sd.store_id
where
purchase_month is not null
group by store_name
- 找出每个门最忙的时候
with base as (
select transaction_id, purchase_time, if(purchase_time like '%M', from_unixtime(unix_timestamp(purchase_time,'hh:mm aa'),'HH:mm'), purchase_time) as time_format, store_id from transaction_details
where
purchase_month is not null
),
timeformat as (
select
purchase_time, transaction_id, (cast(split(time_format, ':')[0] as decimal(4,2)) + cast(split(time_format, ':')[1] as decimal(4,2))/60) as purchase_time_in_hrs, store_id
from base
),
timebucket as (
select
transaction_id, purchase_time, purchase_time_in_hrs, store_id,
if(purchase_time_in_hrs > 5 and purchase_time_in_hrs <=8, 'early morning',
if(purchase_time_in_hrs > 8 and purchase_time_in_hrs <=11, 'morning',
if(purchase_time_in_hrs > 11 and purchase_time_in_hrs <=13, 'noon',
if(purchase_time_in_hrs > 13 and purchase_time_in_hrs <=18, 'afternoon',
if(purchase_time_in_hrs > 18 and purchase_time_in_hrs <=22, 'evening', 'night'))))) as time_bucket
from timeformat
)
--select * from timebucket --this is to verify
select sd.store_name, count(transaction_id) as tran_cnt, time_bucket
from timebucket td join ext_store_details sd on td.store_id = sd.store_id
group by sd.store_name, time_bucket order by sd.store_name, tran_cnt desc
-- from the result we can further divide the bucket afternnon and morning into more details
- 找出每个店的忠诚顾客
with base as (
select
store_name,
customer_id,
sum(td.price) as total_cust_purphase
from
transaction_details td join ext_store_details sd on td.store_id = sd.store_id
where
purchase_month is not null
group by store_name, customer_id
),
rk_cust as (
select
store_name,
customer_id,
total_cust_purphase,
rank() over(partition by store_name order by total_cust_purphase desc) as rn
from base
)
select * from rk_cust where rn <= 5
- 找出明星店的每个员工的最大收入
with base as (
select
store_id,
sum(price) as revenue
from
transaction_details
where
purchase_month is not null
group by store_id
)
select
store_name,
revenue,
employee_number,
round(revenue/employee_number,2) as revenue_per_employee_within_period
from base td
join ext_store_details sd
on td.store_id = sd.store_id
- 针对评分的分析
- 在ext_store_review中找出冲突交易
--冲突交易意味着事务表和审查表中的transaction_id、store_id对不同
select
count(*) --725/8101
from
transaction_details td join ext_store_review sd on td.transaction_id = sd.transaction_id
where
purchase_month is not null and td.store_id != sd.store_id
- 找出每个顾客的平均评论数
select
count(td.transaction_id) as total_trans,
sum(if(sd.transaction_id is null, 1, 0)) as total_review_missed,
sum(if(sd.transaction_id is not null, 1, 0)) as total_review_exist
from
transaction_details td left join ext_store_review sd on td.transaction_id = sd.transaction_id
where
purchase_month is not null
- 找出评论分数的顾客分布、交易分布
select
review_score,
count(distinct customer_id) as num_customer,
count(*) as num_reviews
from
transaction_details td join ext_store_review sd on td.transaction_id = sd.transaction_id
where
purchase_month is not null and review_score <> ''
group by review_score
- 顾客总是在光顾同一家店的时候给好评吗?
select -- not always
count(*) as visit_cnt,
customer_id,
td.store_id
from
transaction_details td join ext_store_review sd on td.transaction_id = sd.transaction_id
where
purchase_month is not null and review_score = '5'
group by customer_id, td.store_id order by visit_cnt desc
总结
- 认识数据问题的影响
- 确定活动的影响
- 练习sql