Hive/Spark小练习-电子商务数据分析

背景

  1. 基于Hive或Spark SQL进行使用练习–我们可以用hive或spark做什么
  2. 某零售企业根据最近一年门店收集的数据进行数据分析展示

数据类型介绍及准备

  1. 环境准备
  1. 使用Zeppelin、Spark SQL进行编写,因为我的zeppelin上不支持hive
  2. hive 中使用while_数据

  3. 验证是否可以正常使用,如图,没问题
  4. hive 中使用while_hive_02

  1. 数据文件介绍
  1. customer table–数据可能存在瑕疵

filed

details

customers_id

Int,1-500

first_name

string

last_name

string

email

string,such as willddy@mail.com

gender

string,Male or female

address

string

country

string

language

string

job

string, job title/position

credit_type

string,credit card type, such as visa

credit_no

string,credit card number

  1. Transaction Table–有些数据是重复的,但是不能随便删除,所以需要数据修复

filed

details

transaction_id

Int, 1-1000

customer_id

Int, 1-500

store_id

Int, 1-5

price

decimal, such as 5.08

product

string, things bought

date

string, when to purchase

time

string, what time to purchase

  1. Store Table

filed

details

store_id

Int, 1-5

store_name

string

employee_number

Int, how many employee in this store

  1. Review Table–有些数据是null,有些是错误的映射

filed

details

stransaction_id

Int, 1-8000

store_id

Int,1-5

review_store

Int,1-5

  1. 数据准备
  1. 下载数据–我直接放我云盘里了;检查数据
  1. 链接:https://pan.baidu.com/s/11fa1wf0HC_I5DEW5WBor3A
    提取码:c2jy
  2. 查了一下没问题
  1. 上传数据到hdfs里–如果在zeppelin里直接放就行了,如果在终端里放记得切换成zeppelin管理员

建表及数据验证

创建外部表来保存数据

%spark2.sql
create external table if not exists ext_customer_details (
customer_id string, 
first_name string,
last_name string,
email string,
gender string,
address string,
country string,
language string,
job string,
credit_type string,
credit_no string
)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
location '/tmp/shopping/data/customer' 
tblproperties ("skip.header.line.count"="1")
%spark2.sql
create external table if not exists ext_transaction_details(
transaction_id string,
customer_id string,
store_id string,
price decimal(8,2),
product string,
purchase_date string,
purchase_time string
)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
location '/tmp/shopping/data/transaction' 
tblproperties ("skip.header.line.count"="1")
%spark2.sql
create external table if not exists ext_store_details(
store_id string,
store_name string,
employee_number int
)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
location '/tmp/shopping/data/store' --this must tblproperties 
tblproperties ("skip.header.line.count"="1")
%spark2.sql
create external table if not exists ext_store_review(
transaction_id string,
store_id string,
review_score int
)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
location '/tmp/shopping/data/review' --this must tblproperties 
tblproperties ("skip.header.line.count"="1")

都进去了[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

hive 中使用while_hive 中使用while_03

数据清洗

要求:

  1. transaction_details中的重复数据, 对重复数据生成新ID
  2. 过滤掉store_review中没有评分的数据
  3. 可以把清洗好的数据放到另一个表或者用View表示
  4. 找出PII(persional information identification)或PCI(personal confidential information)数据进行加密或hash
  5. 重新组织transaction数据按照日期YYYY-MM做分区

过程

  1. 对部分姓名,卡证信息进行加密,效果如下所示
  2. 需要进行一些数据修复,例如这些数据
  1. 即将完成
  2. 结果测试,ok
  3. 清理另一个表

数据分析

  1. 面向客户的数据分析
  1. 找出客户最常用的卡
select credit_type, count(distinct credit_no) as credit_cnt from vw_customer_details group by country, credit_type order by credit_cnt desc

hive 中使用while_数据_04

  1. 找出客户数据中的前5大标题
select job, count(*) as pn from vw_customer_details group by job order by pn desc limit 5

hive 中使用while_hive 中使用while_05

  1. 对美国女性来说,她们中最流行持有什么卡
select credit_type, count(*) as ct from vw_customer_details where country = 'United States' and gender = 'Female'
  1. 按性别和国家进行客户统计
select count(*), country, gender from vw_customer_details group by country, gender
  1. 面向流水的分析
  1. 计算总每月、每季度、每年、每周收益(算是四个小题)
select sum(price) as revenue_mom, purchase_month from transaction_details group by purchase_month order by purchase_month
--
with base as (
select price, 
concat_ws('-', substr(purchase_date, 1, 4), cast(ceil(month(purchase_date)/3.0) as string)) as year_quarter
from 
shopping.transaction_details
)
select 
sum(price) as revenue_qoq, --quarter does not support until hive 1.3
year_quarter
from base group by year_quarter
--
select sum(price) as revenue_mom, substr(purchase_date, 1, 4) as year from transaction_details group by substr(purchase_date, 1, 4)
--
select sum(price) as revenue_wow, date_format(purchase_date, 'u') as weekday from transaction_details group by  date_format(purchase_date, 'u') --1 is Mon and 7 is Sun

hive 中使用while_spark_06

  1. 根据时间分桶计算收益、花销
with base as (
select price, purchase_time, if(purchase_time like '%M', from_unixtime(unix_timestamp(purchase_time,'hh:mm aa'),'HH:mm'), purchase_time) as time_format from transaction_details
),
timeformat as (
select
purchase_time, price, (cast(split(time_format, ':')[0] as decimal(4,2)) +  cast(split(time_format, ':')[1] as decimal(4,2))/60) as purchase_time_in_hrs
from base
),
timebucket as (
select
price, purchase_time, purchase_time_in_hrs,
if(purchase_time_in_hrs > 5 and purchase_time_in_hrs <=8, 'early morning',
if(purchase_time_in_hrs > 8 and purchase_time_in_hrs <=11, 'morning',
if(purchase_time_in_hrs > 11 and purchase_time_in_hrs <=13, 'noon',
if(purchase_time_in_hrs > 13 and purchase_time_in_hrs <=18, 'afternoon',
if(purchase_time_in_hrs > 18 and purchase_time_in_hrs <=22, 'evening', 'night'))))) as time_bucket
from timeformat
)
--select * from timebucket --this is to verify
select time_bucket, avg(price) as avg_spend, sum(price)/1000 as revenue_k from timebucket group by time_bucket
  1. 根据每周计算花销
select 
avg(price) as avg_price, 
date_format(purchase_date, 'u') as weekday
from transaction_details
where
date_format(purchase_date, 'u') is not null
group by date_format(purchase_date, 'u')
  1. 计算每年、月、周的交易数量
with base as (
select 
transaction_id, date_format(purchase_date, 'u') as weekday, purchase_month,
concat_ws('-', substr(purchase_date, 1, 4), cast(ceil(month(purchase_date)/3.0) as string)) as year_quarter, substr(purchase_date, 1, 4) as year
from transaction_details
where purchase_month is not null
)
select count(distinct transaction_id) as total, weekday, purchase_month, year_quarter, year 
from base group by weekday, purchase_month, year_quarter, year order by year, purchase_month
  1. 找出交易最多的10个顾客
with base as (
select 
customer_id,
count(distinct transaction_id) as trans_cnt,
sum(price) as spend_total
from transaction_details
where purchase_month is not null
group by customer_id
),
cust_detail as (
select
*,
concat_ws(' ', first_name, last_name) as cust_name
from
base td join vw_customer_details cd on td.customer_id = cd.customer_id
)
select 
trans_cnt,
cust_name as top10_trans_cust
from cust_detail order by trans_cnt desc limit 10
  1. 找出花销做多的10个客户
with base as (
select 
customer_id,
count(distinct transaction_id) as trans_cnt,
sum(price) as spend_total
from transaction_details
where purchase_month is not null
group by customer_id
),
cust_detail as (
select
*,
concat_ws(' ', first_name, last_name) as cust_name
from
base td join vw_customer_details cd on td.customer_id = cd.customer_id
)
select 
spend_total,
cust_name as top10_trans_cust
from cust_detail order by spend_total desc limit 10
  1. 该交易周期里谁的交易数量最少
with base as (
select 
customer_id,
count(distinct transaction_id) as trans_cnt
from
transaction_details
where purchase_month is not null
group by customer_id
)
select
*
from
base order by trans_cnt limit 10
  1. 计算每个季度、每周的不重复(唯一、unique customer)顾客的数量
with base as (
select 
transaction_id,
concat_ws('-', substr(purchase_date, 1, 4), cast(ceil(month(purchase_date)/3.0) as string)) as year_quarter, substr(purchase_date, 1, 4) as year
from transaction_details
where purchase_month is not null
)
select 
count(distinct transaction_id) as total, year_quarter, year 
from base 
group by year_quarter, year 
order by year_quarter
  1. 计算所有活动的顾客的最大平均值
with base as (
select 
customer_id,
avg(price) as price_avg,
max(price) as price_max
from
transaction_details
where 
purchase_month is not null
group by customer_id
)
select 
max(price_avg) 
from base
  1. 每月花销最大的是谁
with base as (
select customer_id, purchase_month, sum(price) as price_sum, count(transaction_id) as trans_cnt
from transaction_details
where purchase_month is not null group by purchase_month, customer_id
),
rank_sum as (
select 
rank() over(partition by purchase_month order by price_sum desc) as rn_sum,
rank() over(partition by purchase_month order by trans_cnt desc) as rn_cnt,
purchase_month,
price_sum,
trans_cnt,
customer_id
from base
)
select 
purchase_month,
'spend' as measure_name,
price_sum as measure_value,
customer_id
from rank_sum where rn_sum = 1
union all
select
purchase_month,
'visit' as measure_name,
trans_cnt as measure_value,
customer_id
from rank_sum where rn_cnt = 1 
order by measure_name, purchase_month
  1. 根据总价格、购买频率、顾客数量找出5个最受欢迎的产品
select 
product,
sum(price) as price_sum
from
transaction_details
where 
purchase_month is not null
group by product
order by price_sum desc limit 5
--
select 
product,
count(transaction_id) as freq_buy
from
transaction_details
where 
purchase_month is not null
group by product
order by freq_buy desc limit 5
--
select 
product,
count(customer_id) as freq_cust
from
transaction_details
where 
purchase_month is not null
group by product
order by freq_cust desc limit 5
  1. 验证前5的细节
select * from transaction_details where product in ('Goat - Whole Cut')
  1. 面向门面的分析
  1. 根据-不重复顾客id和进店次数、-顾客的购买次数、-顾得的交易 找出最受欢迎的门店(算3个题)
select
sd.store_name,
count(distinct customer_id) as unique_visit 
from
transaction_details td join ext_store_details sd on td.store_id = sd.store_id
group by store_name order by unique_visit desc
limit 5
--
select
sd.store_name,
sum(td.price) as total_revnue 
from
transaction_details td join ext_store_details sd on td.store_id = sd.store_id
group by store_name 
order by total_revnue desc
limit 5
--
select
sd.store_name,
count(transaction_id) as unique_purchase
from
transaction_details td join ext_store_details sd on td.store_id = sd.store_id
group by store_name order by unique_purchase desc
limit 5
  1. 根据门店和顾客id找出最受欢迎的产品
with base as (select 
store_id,
product,
count(distinct customer_id) as freq_cust
from
transaction_details
where 
purchase_month is not null
group by store_id, product
),
prod_rank as (
select
store_id,
product,
freq_cust,
rank() over(partition by store_id order by freq_cust desc) as rn
from base
)
select store_name, product, freq_cust
from prod_rank td join ext_store_details sd on td.store_id = sd.store_id
where td.rn = 1
  1. 计算每个店的员工与顾客访问比率(应该员工数/顾客访问次数的意思)
with base as (
select 
store_id,
count(distinct customer_id, purchase_date) as cust_visit
from
transaction_details
where 
purchase_month is not null
group by store_id
)
select
store_name,
cust_visit,
employee_number,
round(cust_visit/employee_number,2) as cust_per_employee_within_period
from base td join ext_store_details sd on td.store_id = sd.store_id
  1. 计算每个店每月每年的收益
select 
store_name,
purchase_month,
sum(price) as revenue
from
transaction_details td join ext_store_details sd on td.store_id = sd.store_id
where 
purchase_month is not null
group by store_name, purchase_month
  1. 制作每个店的总收益饼图
select 
store_name,
sum(price) as revenue
from
transaction_details td join ext_store_details sd on td.store_id = sd.store_id
where 
purchase_month is not null
group by store_name
  1. 找出每个门最忙的时候
with base as (
select transaction_id, purchase_time, if(purchase_time like '%M', from_unixtime(unix_timestamp(purchase_time,'hh:mm aa'),'HH:mm'), purchase_time) as time_format, store_id from transaction_details
where 
purchase_month is not null
),
timeformat as (
select
purchase_time, transaction_id, (cast(split(time_format, ':')[0] as decimal(4,2)) +  cast(split(time_format, ':')[1] as decimal(4,2))/60) as purchase_time_in_hrs, store_id
from base
),
timebucket as (
select
transaction_id, purchase_time, purchase_time_in_hrs, store_id,
if(purchase_time_in_hrs > 5 and purchase_time_in_hrs <=8, 'early morning',
if(purchase_time_in_hrs > 8 and purchase_time_in_hrs <=11, 'morning',
if(purchase_time_in_hrs > 11 and purchase_time_in_hrs <=13, 'noon',
if(purchase_time_in_hrs > 13 and purchase_time_in_hrs <=18, 'afternoon',
if(purchase_time_in_hrs > 18 and purchase_time_in_hrs <=22, 'evening', 'night'))))) as time_bucket
from timeformat
)
--select * from timebucket --this is to verify
select sd.store_name, count(transaction_id) as tran_cnt, time_bucket 
from timebucket td join ext_store_details sd on td.store_id = sd.store_id
group by sd.store_name, time_bucket order by sd.store_name, tran_cnt desc
-- from the result we can further divide the bucket afternnon and morning into more details
  1. 找出每个店的忠诚顾客
with base as (
select
store_name,
customer_id,
sum(td.price) as total_cust_purphase
from
transaction_details td join ext_store_details sd on td.store_id = sd.store_id
where 
purchase_month is not null
group by store_name, customer_id 
),
rk_cust as (
select 
store_name,
customer_id,
total_cust_purphase,
rank() over(partition by store_name order by total_cust_purphase desc) as rn
from base
)
select * from rk_cust where rn <= 5
  1. 找出明星店的每个员工的最大收入
with base as (
select 
store_id,
sum(price) as revenue
from
transaction_details
where 
purchase_month is not null
group by store_id
)
select
store_name,
revenue,
employee_number,
round(revenue/employee_number,2) as revenue_per_employee_within_period
from base td
join ext_store_details sd
on td.store_id = sd.store_id
  1. 针对评分的分析
  1. 在ext_store_review中找出冲突交易
--冲突交易意味着事务表和审查表中的transaction_id、store_id对不同
select
count(*) --725/8101
from
transaction_details td join ext_store_review sd on td.transaction_id = sd.transaction_id
where 
purchase_month is not null and td.store_id != sd.store_id
  1. 找出每个顾客的平均评论数
select
count(td.transaction_id) as total_trans,
sum(if(sd.transaction_id is null, 1, 0)) as total_review_missed,
sum(if(sd.transaction_id is not null, 1, 0)) as total_review_exist
from
transaction_details td left join ext_store_review sd on td.transaction_id = sd.transaction_id
where 
purchase_month is not null
  1. 找出评论分数的顾客分布、交易分布
select
review_score,
count(distinct customer_id) as num_customer,
count(*) as num_reviews
from
transaction_details td join ext_store_review sd on td.transaction_id = sd.transaction_id
where 
purchase_month is not null and review_score <> ''
group by review_score
  1. 顾客总是在光顾同一家店的时候给好评吗?
select -- not always
count(*) as visit_cnt,
customer_id,
td.store_id
from
transaction_details td join ext_store_review sd on td.transaction_id = sd.transaction_id
where 
purchase_month is not null and review_score = '5'
group by customer_id, td.store_id order by visit_cnt desc

总结

  1. 认识数据问题的影响
  2. 确定活动的影响
  3. 练习sql