hive 中使用while

转载

墨染心语 2024-11-04 19:17:11

文章标签 hive 中使用while sql spark hive 数据 文章分类 Hive 大数据

Hive/Spark小练习-电子商务数据分析

背景

基于Hive或Spark SQL进行使用练习–我们可以用hive或spark做什么
某零售企业根据最近一年门店收集的数据进行数据分析展示

数据类型介绍及准备

环境准备

使用Zeppelin、Spark SQL进行编写，因为我的zeppelin上不支持hive

hive 中使用while_数据

验证是否可以正常使用，如图，没问题

hive 中使用while_hive_02

数据文件介绍

customer table–数据可能存在瑕疵

filed	details
customers_id	Int,1-500
first_name	string
last_name	string
email	string,such as willddy@mail.com
gender	string,Male or female
address	string
country	string
language	string
job	string, job title/position
credit_type	string,credit card type, such as visa
credit_no	string,credit card number

Transaction Table–有些数据是重复的，但是不能随便删除，所以需要数据修复

filed	details
transaction_id	Int, 1-1000
customer_id	Int, 1-500
store_id	Int, 1-5
price	decimal, such as 5.08
product	string, things bought
date	string, when to purchase
time	string, what time to purchase

Store Table

filed	details
store_id	Int, 1-5
store_name	string
employee_number	Int, how many employee in this store

Review Table–有些数据是null，有些是错误的映射

filed	details
stransaction_id	Int, 1-8000
store_id	Int,1-5
review_store	Int,1-5

数据准备

下载数据–我直接放我云盘里了；检查数据

链接：https://pan.baidu.com/s/11fa1wf0HC_I5DEW5WBor3A
提取码：c2jy
查了一下没问题

上传数据到hdfs里–如果在zeppelin里直接放就行了，如果在终端里放记得切换成zeppelin管理员

建表及数据验证

创建外部表来保存数据

%spark2.sql
create external table if not exists ext_customer_details (
customer_id string, 
first_name string,
last_name string,
email string,
gender string,
address string,
country string,
language string,
job string,
credit_type string,
credit_no string
)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
location '/tmp/shopping/data/customer' 
tblproperties ("skip.header.line.count"="1")

%spark2.sql
create external table if not exists ext_transaction_details(
transaction_id string,
customer_id string,
store_id string,
price decimal(8,2),
product string,
purchase_date string,
purchase_time string
)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
location '/tmp/shopping/data/transaction' 
tblproperties ("skip.header.line.count"="1")

%spark2.sql
create external table if not exists ext_store_details(
store_id string,
store_name string,
employee_number int
)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
location '/tmp/shopping/data/store' --this must tblproperties 
tblproperties ("skip.header.line.count"="1")

%spark2.sql
create external table if not exists ext_store_review(
transaction_id string,
store_id string,
review_score int
)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
location '/tmp/shopping/data/review' --this must tblproperties 
tblproperties ("skip.header.line.count"="1")

都进去了[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

hive 中使用while_hive 中使用while_03

数据清洗

要求：

transaction_details中的重复数据，对重复数据生成新ID
过滤掉store_review中没有评分的数据
可以把清洗好的数据放到另一个表或者用View表示
找出PII（persional information identification）或PCI（personal confidential information）数据进行加密或hash
重新组织transaction数据按照日期YYYY-MM做分区

过程

对部分姓名，卡证信息进行加密，效果如下所示
需要进行一些数据修复，例如这些数据

即将完成
结果测试，ok
清理另一个表

数据分析

面向客户的数据分析

找出客户最常用的卡

select credit_type, count(distinct credit_no) as credit_cnt from vw_customer_details group by country, credit_type order by credit_cnt desc

hive 中使用while_数据_04

找出客户数据中的前5大标题

select job, count(*) as pn from vw_customer_details group by job order by pn desc limit 5

hive 中使用while_hive 中使用while_05

对美国女性来说，她们中最流行持有什么卡

select credit_type, count(*) as ct from vw_customer_details where country = 'United States' and gender = 'Female'

按性别和国家进行客户统计

select count(*), country, gender from vw_customer_details group by country, gender

面向流水的分析

计算总每月、每季度、每年、每周收益（算是四个小题）

select sum(price) as revenue_mom, purchase_month from transaction_details group by purchase_month order by purchase_month
--
with base as (
select price, 
concat_ws('-', substr(purchase_date, 1, 4), cast(ceil(month(purchase_date)/3.0) as string)) as year_quarter
from 
shopping.transaction_details
)
select 
sum(price) as revenue_qoq, --quarter does not support until hive 1.3
year_quarter
from base group by year_quarter
--
select sum(price) as revenue_mom, substr(purchase_date, 1, 4) as year from transaction_details group by substr(purchase_date, 1, 4)
--
select sum(price) as revenue_wow, date_format(purchase_date, 'u') as weekday from transaction_details group by  date_format(purchase_date, 'u') --1 is Mon and 7 is Sun

hive 中使用while_spark_06

根据时间分桶计算收益、花销

with base as (
select price, purchase_time, if(purchase_time like '%M', from_unixtime(unix_timestamp(purchase_time,'hh:mm aa'),'HH:mm'), purchase_time) as time_format from transaction_details
),
timeformat as (
select
purchase_time, price, (cast(split(time_format, ':')[0] as decimal(4,2)) +  cast(split(time_format, ':')[1] as decimal(4,2))/60) as purchase_time_in_hrs
from base
),
timebucket as (
select
price, purchase_time, purchase_time_in_hrs,
if(purchase_time_in_hrs > 5 and purchase_time_in_hrs <=8, 'early morning',
if(purchase_time_in_hrs > 8 and purchase_time_in_hrs <=11, 'morning',
if(purchase_time_in_hrs > 11 and purchase_time_in_hrs <=13, 'noon',
if(purchase_time_in_hrs > 13 and purchase_time_in_hrs <=18, 'afternoon',
if(purchase_time_in_hrs > 18 and purchase_time_in_hrs <=22, 'evening', 'night'))))) as time_bucket
from timeformat
)
--select * from timebucket --this is to verify
select time_bucket, avg(price) as avg_spend, sum(price)/1000 as revenue_k from timebucket group by time_bucket

根据每周计算花销

select 
avg(price) as avg_price, 
date_format(purchase_date, 'u') as weekday
from transaction_details
where
date_format(purchase_date, 'u') is not null
group by date_format(purchase_date, 'u')

计算每年、月、周的交易数量

with base as (
select 
transaction_id, date_format(purchase_date, 'u') as weekday, purchase_month,
concat_ws('-', substr(purchase_date, 1, 4), cast(ceil(month(purchase_date)/3.0) as string)) as year_quarter, substr(purchase_date, 1, 4) as year
from transaction_details
where purchase_month is not null
)
select count(distinct transaction_id) as total, weekday, purchase_month, year_quarter, year 
from base group by weekday, purchase_month, year_quarter, year order by year, purchase_month

找出交易最多的10个顾客

with base as (
select 
customer_id,
count(distinct transaction_id) as trans_cnt,
sum(price) as spend_total
from transaction_details
where purchase_month is not null
group by customer_id
),
cust_detail as (
select
*,
concat_ws(' ', first_name, last_name) as cust_name
from
base td join vw_customer_details cd on td.customer_id = cd.customer_id
)
select 
trans_cnt,
cust_name as top10_trans_cust
from cust_detail order by trans_cnt desc limit 10

找出花销做多的10个客户

with base as (
select 
customer_id,
count(distinct transaction_id) as trans_cnt,
sum(price) as spend_total
from transaction_details
where purchase_month is not null
group by customer_id
),
cust_detail as (
select
*,
concat_ws(' ', first_name, last_name) as cust_name
from
base td join vw_customer_details cd on td.customer_id = cd.customer_id
)
select 
spend_total,
cust_name as top10_trans_cust
from cust_detail order by spend_total desc limit 10

该交易周期里谁的交易数量最少

with base as (
select 
customer_id,
count(distinct transaction_id) as trans_cnt
from
transaction_details
where purchase_month is not null
group by customer_id
)
select
*
from
base order by trans_cnt limit 10

计算每个季度、每周的不重复（唯一、unique customer）顾客的数量

with base as (
select 
transaction_id,
concat_ws('-', substr(purchase_date, 1, 4), cast(ceil(month(purchase_date)/3.0) as string)) as year_quarter, substr(purchase_date, 1, 4) as year
from transaction_details
where purchase_month is not null
)
select 
count(distinct transaction_id) as total, year_quarter, year 
from base 
group by year_quarter, year 
order by year_quarter

计算所有活动的顾客的最大平均值

with base as (
select 
customer_id,
avg(price) as price_avg,
max(price) as price_max
from
transaction_details
where 
purchase_month is not null
group by customer_id
)
select 
max(price_avg) 
from base

每月花销最大的是谁

with base as (
select customer_id, purchase_month, sum(price) as price_sum, count(transaction_id) as trans_cnt
from transaction_details
where purchase_month is not null group by purchase_month, customer_id
),
rank_sum as (
select 
rank() over(partition by purchase_month order by price_sum desc) as rn_sum,
rank() over(partition by purchase_month order by trans_cnt desc) as rn_cnt,
purchase_month,
price_sum,
trans_cnt,
customer_id
from base
)
select 
purchase_month,
'spend' as measure_name,
price_sum as measure_value,
customer_id
from rank_sum where rn_sum = 1
union all
select
purchase_month,
'visit' as measure_name,
trans_cnt as measure_value,
customer_id
from rank_sum where rn_cnt = 1 
order by measure_name, purchase_month

根据总价格、购买频率、顾客数量找出5个最受欢迎的产品

select 
product,
sum(price) as price_sum
from
transaction_details
where 
purchase_month is not null
group by product
order by price_sum desc limit 5
--
select 
product,
count(transaction_id) as freq_buy
from
transaction_details
where 
purchase_month is not null
group by product
order by freq_buy desc limit 5
--
select 
product,
count(customer_id) as freq_cust
from
transaction_details
where 
purchase_month is not null
group by product
order by freq_cust desc limit 5

验证前5的细节

select * from transaction_details where product in ('Goat - Whole Cut')

面向门面的分析

根据-不重复顾客id和进店次数、-顾客的购买次数、-顾得的交易找出最受欢迎的门店（算3个题）

select
sd.store_name,
count(distinct customer_id) as unique_visit 
from
transaction_details td join ext_store_details sd on td.store_id = sd.store_id
group by store_name order by unique_visit desc
limit 5
--
select
sd.store_name,
sum(td.price) as total_revnue 
from
transaction_details td join ext_store_details sd on td.store_id = sd.store_id
group by store_name 
order by total_revnue desc
limit 5
--
select
sd.store_name,
count(transaction_id) as unique_purchase
from
transaction_details td join ext_store_details sd on td.store_id = sd.store_id
group by store_name order by unique_purchase desc
limit 5

根据门店和顾客id找出最受欢迎的产品

with base as (select 
store_id,
product,
count(distinct customer_id) as freq_cust
from
transaction_details
where 
purchase_month is not null
group by store_id, product
),
prod_rank as (
select
store_id,
product,
freq_cust,
rank() over(partition by store_id order by freq_cust desc) as rn
from base
)
select store_name, product, freq_cust
from prod_rank td join ext_store_details sd on td.store_id = sd.store_id
where td.rn = 1

计算每个店的员工与顾客访问比率（应该员工数/顾客访问次数的意思）

with base as (
select 
store_id,
count(distinct customer_id, purchase_date) as cust_visit
from
transaction_details
where 
purchase_month is not null
group by store_id
)
select
store_name,
cust_visit,
employee_number,
round(cust_visit/employee_number,2) as cust_per_employee_within_period
from base td join ext_store_details sd on td.store_id = sd.store_id

计算每个店每月每年的收益

select 
store_name,
purchase_month,
sum(price) as revenue
from
transaction_details td join ext_store_details sd on td.store_id = sd.store_id
where 
purchase_month is not null
group by store_name, purchase_month

制作每个店的总收益饼图

select 
store_name,
sum(price) as revenue
from
transaction_details td join ext_store_details sd on td.store_id = sd.store_id
where 
purchase_month is not null
group by store_name

找出每个门最忙的时候

with base as (
select transaction_id, purchase_time, if(purchase_time like '%M', from_unixtime(unix_timestamp(purchase_time,'hh:mm aa'),'HH:mm'), purchase_time) as time_format, store_id from transaction_details
where 
purchase_month is not null
),
timeformat as (
select
purchase_time, transaction_id, (cast(split(time_format, ':')[0] as decimal(4,2)) +  cast(split(time_format, ':')[1] as decimal(4,2))/60) as purchase_time_in_hrs, store_id
from base
),
timebucket as (
select
transaction_id, purchase_time, purchase_time_in_hrs, store_id,
if(purchase_time_in_hrs > 5 and purchase_time_in_hrs <=8, 'early morning',
if(purchase_time_in_hrs > 8 and purchase_time_in_hrs <=11, 'morning',
if(purchase_time_in_hrs > 11 and purchase_time_in_hrs <=13, 'noon',
if(purchase_time_in_hrs > 13 and purchase_time_in_hrs <=18, 'afternoon',
if(purchase_time_in_hrs > 18 and purchase_time_in_hrs <=22, 'evening', 'night'))))) as time_bucket
from timeformat
)
--select * from timebucket --this is to verify
select sd.store_name, count(transaction_id) as tran_cnt, time_bucket 
from timebucket td join ext_store_details sd on td.store_id = sd.store_id
group by sd.store_name, time_bucket order by sd.store_name, tran_cnt desc
-- from the result we can further divide the bucket afternnon and morning into more details

找出每个店的忠诚顾客

with base as (
select
store_name,
customer_id,
sum(td.price) as total_cust_purphase
from
transaction_details td join ext_store_details sd on td.store_id = sd.store_id
where 
purchase_month is not null
group by store_name, customer_id 
),
rk_cust as (
select 
store_name,
customer_id,
total_cust_purphase,
rank() over(partition by store_name order by total_cust_purphase desc) as rn
from base
)
select * from rk_cust where rn <= 5

找出明星店的每个员工的最大收入

with base as (
select 
store_id,
sum(price) as revenue
from
transaction_details
where 
purchase_month is not null
group by store_id
)
select
store_name,
revenue,
employee_number,
round(revenue/employee_number,2) as revenue_per_employee_within_period
from base td
join ext_store_details sd
on td.store_id = sd.store_id

针对评分的分析

在ext_store_review中找出冲突交易

--冲突交易意味着事务表和审查表中的transaction_id、store_id对不同
select
count(*) --725/8101
from
transaction_details td join ext_store_review sd on td.transaction_id = sd.transaction_id
where 
purchase_month is not null and td.store_id != sd.store_id

找出每个顾客的平均评论数

select
count(td.transaction_id) as total_trans,
sum(if(sd.transaction_id is null, 1, 0)) as total_review_missed,
sum(if(sd.transaction_id is not null, 1, 0)) as total_review_exist
from
transaction_details td left join ext_store_review sd on td.transaction_id = sd.transaction_id
where 
purchase_month is not null

找出评论分数的顾客分布、交易分布

select
review_score,
count(distinct customer_id) as num_customer,
count(*) as num_reviews
from
transaction_details td join ext_store_review sd on td.transaction_id = sd.transaction_id
where 
purchase_month is not null and review_score <> ''
group by review_score

顾客总是在光顾同一家店的时候给好评吗？

select -- not always
count(*) as visit_cnt,
customer_id,
td.store_id
from
transaction_details td join ext_store_review sd on td.transaction_id = sd.transaction_id
where 
purchase_month is not null and review_score = '5'
group by customer_id, td.store_id order by visit_cnt desc