1.order by,sort by,distribute by,cluster by的区别?
答:
(1)order by:
order by会对所给的全部数据进行全局排序,并且,不管有多少数据,都只启动一个reducer来处理。
注意:如果指定了hive.mapred.mode=strict(默认值是nonstrict),这时就必须指定limit来限制输出条数,原因是:所有的数据都会在同一个reducer端进行,数据量大的情况下可能不能出结果,那么在这样的严格模式下,必须指定输出的条数。
(2)sort by:
sort by是局部排序。根据数据量的大小启动一到多个reducer来干活,并且,它会在进入reduce之前为每个reducer都产生一个排序文件(此时,并不能保证所有的数据是有序的 )。这样的好处是提高了全局排序的效率。
(3)distribute by:
distribute by的功能是:distribute by 控制map结果的分发,它会将具有相同字段的map输出分发到一个reduce节点上做处理。即:控制某个特定行到某个reducer中,为后续可能发生的聚集操作做准备。
注意:distribute by必须要写在sort by之前。
(4)cluster by:
当distribute by和sort by字段相同时,可以使用cluster by方式,即cluster by的功能就是distribute by和sort by相结合 。
注意:被cluster by指定的列只能是降序,不能指定asc和desc。
2.聚合函数是否可以写在order by后面,为什么?
答:HIve只能用聚合函数的别名排序,不可以用聚合函数的表达式排序 ,order by的执行顺序在select之后,所以需使用重新定义的列名进行排序。;
Mysql可以用聚合函数的别名排序,也可以用聚合函数的表达式排序。
主要原因,执行顺序的问题:
(1)from
(2)join
(3)on
(4)where
(5)select
(6)group by
(7)having
(8)order by
(9)limit
3.有以下数据
10001 100 2019-03-01
10002 200 2019-03-02
10003 300 2019-03-03
10004 400 2019-04-01
10005 500 2019-04-02
10006 600 2019-04-03
10007 700 2019-05-01
10008 800 2019-05-02
10009 900 2019-05-03
10010 910 2019-06-01
10011 920 2019-06-02
10012 930 2019-06-03
3.1 把以上数据利用动态分区导入到分区表中(按照年、月进行多级分区)
答:
(1)创建普通表与分区表:
create table t_price(
id int,
price int,
date_time string
)row format delimited fields terminated by '\t';
create table t_price_partition(
id int,
price int,
)partitioned BY(year string, month string)
row format delimited fields terminated by '\t';
(2)导入数据到普通表
load data local inpath '/opt/bigdata/data/t_price.txt' overwrite into table t_price;
(3)动态加载数据到分区表中
hive> set hive.exec.dynamic.partition=true; //使用动态分区
hive> set hive.exec.dynamic.partition.mode=nonstrict; //非严格模式
insert into table t_price_partition partition(year, month) select order_number,order_price,split(order_time,'-')[0],split(order_time,'-')[1] from t_price;
4.根据user_trade(用户购买明细)数据创建出合理的表结构,导入数据
数据(其中一个分区数据dt='2017-01-01'):
Allison 4 688.8 2755.2 shoes 1483822729
Francis 83 1.1 91.3 food 1483788170
创建user_trade表
create table if not exists user_trade (
user_name string,
piece int,
price double,
pay_amount double,
goods_category string,
pay_time bigint)
partitioned by (dt string)
row format delimited fields terminated by '\t';
先设置动态分区
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions=10000;
set hive.exec.max.dynamic.partitions.pernode=10000;
上传数据到hdfs
hdfs dfs -put /home/hadoop/user_trade/* /user/hive_remote/warehouse/db_hive.db/user_trade
进行表修复
msck repair table + 表名
查看分区
show partitions + 表名;
4.1 查出2018年一月到四月,每个品类有多少人购买,累计金额是多少?
select goods_category,
count(distinct user_name) as user_num,
sum(pay_amount) as total_amount
from user_trade
where dt between '2018-01-01' and '2018-04-30'
group by goods_category;
4.2 查出2018年4月,支付金额超过五万元的用户
select user_name,
sum(pay_amount) as total_amount
from user_trade
where dt between '2018-04-01' and '2018-04-30'
group by user_name having sum(pay_amount)>50000;
//方案2
select user_name,total_amount
from (select user_name,
sum(pay_amount) as total_amount
from user_trade
where dt between '2018-04-01' and '2018-04-30'
group by user_name) t
where t.total_amount>50000;
4.3 查出2018年4月,支付金额最多的top5用户
select user_name,
sum(pay_amount) as total_amount
from user_trade
where dt between '2018-04-01' and '2018-04-30'
group by user_name
order by total_amount desc limit 5;
5.根据user_info(用户信息)数据创建出合理的表结构,导入数据
数据:
10001 Abby female 38 hangzhou 2018-04-13 01:06:07 2 {"systemtype": "android", "education": "doctor", "marriage_status": "1", "phonebrand": "VIVO"} systemtype:android,education:doctor,marriage_status:1,phonebrand:VIVO
10002 Ailsa female 42 shenzhen 2018-06-03 20:30:03 2 {"systemtype": "android", "education": "bachelor", "marriage_status": "0", "phonebrand": "YIJIA"} systemtype:android,education:bachelor,marriage_status:0,phonebrand:YIJIA
10003 Alice female 16 changchun 2018-12-04 07:34:27 3 {"systemtype": "ios", "education": "bachelor", "marriage_status": "1", "phonebrand": "iphone7"} systemtype:ios,education:bachelor,marriage_status:1,phonebrand:iphone7
创建user_info表:
create table if not exists user_info (
user_id string,
user_name string,
sex string,
age int,
city string,
firstactivetime string,
level int,
extra1 string,
extra2 map<string,string>)
row format delimited fields terminated by '\t'
collection items terminated by ','
map keys terminated by ':'
lines terminated by '\n'
stored as textfile;
加载数据:
load data local inpath '/home/hadoop/user_info/user_info.txt' into table user_info;
5.1 用户的首次激活时间,与2019年5月1日的日期间隔
select user_name,
datediff('2019-05-01', to_date(firstactivetime))
from user_info
limit 10;
注释:datediff(string enddate,string startdate):结束日期减去开始日期的天数
5.2 统计一下四个年龄段20岁以下,20-30岁,30-40岁,40岁以上的用户数
select
case when age<20 then '20岁以下'
when age>=20 and age <30 then '20-30岁'
when age>=30 and age <40 then '30-40岁'
else '40岁以上' end,
count(distinct user_id) as user_num
from user_info
group by
case when age<20 then '20岁以下'
when age>=20 and age <30 then '20-30岁'
when age>=30 and age <40 then '30-40岁'
else '40岁以上' end;
5.3 统计每个性别用户等级高低的分布情况(level大于5为高级)
select sex,
if(level>5,'高','低'),
count(distinct user_id) as user_num
from user_info
group by sex,
if(level>5,'高','低');
5.4 统计每个月新激活的用户数
select substr(firstactivetime,1,7) as month,
count(distinct user_id) as user_num
from user_info
group by substr(firstactivetime,1,7);
5.5 统计不同手机品牌的用户数
select extra2['phonebrand'] as phone_brand,
count(distinct user_id) as user_num
from user_info
group by extra2['phonebrand'];
#使用函数get_json_object将string转化为json
select get_json_object(extra1,'$.phonebrand') as phone_brand,
count(distinct user_id) as user_num
from user_info
group by get_json_object(extra1,'$.phonebrand');
6.现在我们已经有了两张表,我们再根据user_refund(退款信息)创建出合理的表结构,导入数据
数据(其中一个分区数据dt='2017-01-13'):
Carroll 43 8600.0 2017-01-13 02:27:59
DEMI 26 57772.0 2017-01-13 11:35:04
创建user_refund表:
create table if not exists user_refund(
user_name string,
refund_piece int,
refund_amount double,
refund_time string)
partitioned by (dt string)
row format delimited fields terminated by '\t';
先设置动态分区
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions=10000;
set hive.exec.max.dynamic.partitions.pernode=10000;
上传数据到hdfs
load data local inpath '/home/hadoop/user_info/user_refund.txt' into table user_refund;
进行表修复
msck repair table + 表名
查看分区
show partitions + 表名;
6.1 在2018年购买后又退款的用户
select a.user_name
from
(select distinct user_name
from user_trade
where year(dt)=2018
) a
join
(select distinct user_name
from user_refund
where year(dt)=2018) b
on a.user_name = b.user_name;
6.2 在2018年购买,但是没有退款的用户
select a.user_name
from (
select distinct user_name
from user_trade
where year(dt) = 2018
) a
left join
(select distinct user_name
from user_refund
where year(dt)=2018)b
on a.user_name = b.user_name
where b.user_name is null;
6.3 在2017年和2018年都购买的用户
select a.user_name
from (select distinct user_name
from user_trade
where year(dt)=2017) a
join
(select distinct user_name
from user_trade
where year(dt)=2018) b
on a.user_name=b.user_name;
6.4 在2018年购买用户的学历分布在
select b.education,
count(a.user_name)
from (select distinct user_name
from user_trade
where year(dt)=2018) a
join
(select user_name,
get_json_object(extra1, '$.education') as education
from user_info) b
on a.user_name = b.user_name
group by education;
6.5 2018年每个用户的支付和退款金额汇总
select a.user_name,
sum(a.pay_amount) as pay_amount,
sum(a.refund_amount) as refund_amount
from
(select user_name,
sum(pay_amount) as pay_amount,
0 as refund_amount
from user_trade
where year(dt)=2018
group by user_name
union all
select user_name,
0 as pay_amount,
sum(refund_amount) as refund_amount
from user_refund
where year(dt)=2018
group by user_name) a
group by a.user_name;
####方案二
select coalesce(a.user_name,b.user_name),
a.pay_amount,
b.refund_amount
from
(
select user_name,
sum(pay_amount) as pay_amount,
0 as refund_amount
from user_trade
where year(dt)=2018
group by user_name
)a
full join
(
select user_name,
sum(refund_amount) as refund_amount
from user_refund
where year(dt)=2018
group by user_name
)b on a.user_name=b.user_name;
6.6 首次激活时间在2017年,但是一直没有支付的用户年龄段分布
select a.age_level,
count(a.user_name)
from
(select user_name,
case when age<20 then '20岁以下'
when age>=20 and age<30 then '20-30岁'
when age>=30 and age<40 then '30-40岁'
else '40岁以上' end as age_level
from user_info
where year(firstactivetime)=2017) a
left join
(select distinct user_name
from user_trade
where dt >= 2017) b
on a.user_name = b.user_name
where b.user_name is null
group by age_level;
7.sql test:已经创建了三张表(用户信息表,交易明细表,退款表)
7.1 统计2018年每月的支付总额和当年累计支付总额
select a.month,
a.pay_amount,
sum(a.pay_amount) over(order by a.month) //逐月累加
from
(select month(dt) month,
sum(pay_amount) pay_amount
from user_trade
where year(dt)=2018
group by month(dt)) a;
7.2 统计2017-2018每月的支付总额和当年累积支付总额
select a.year,
a.month,
a.pay_amount,
sum(a.pay_amount) over(partition by a.year order by a.month)
from
(select year(dt) year,
month(dt) month,
sum(pay_amount) pay_amount
from user_trade
where year(dt) in (2017,2018)
group by year(dt),
month(dt))a;
7.3 统计2018年1月,用户购买商品品类数量的排名
select user_name,
count(distinct goods_category),
row_number() over(order by count(distinct goods_category)),
rank() over(order by count(distinct goods_category)),
dense_rank() over(order by count(distinct goods_category))
from user_trade
where substr(dt,1,7)='2018-01'
group by user_name;
7.4 选出2019年支付金额排名在第10,20,30的用户
select a.user_name,
a.pay_amount,
a.rank
from
(select user_name,
sum(pay_amount) pay_amount,
rank() over(order by sum(pay_amount) desc) rank
from user_trade
where year(dt)=2019
group by user_name)a
where a.rank in (10,20,30);
7.5 选出2018年退款金额排名前10%的用户
select a.user_name,
a.refund_amount,
a.level
from
(select user_name,
sum(refund_amount) refund_amount,
ntile(10) over(order by sum(refund_amount) desc) level
from user_refund
where year(dt)=2018
group by user_name)a
where a.level=1;
7.6 支付时间间隔超过100天的用户数
select count(distinct user_name)
from
(select user_name,
dt,
lead(dt) over(partition by user_name order by dt) lead_dt
from user_trade
where dt>'0')a
where a.lead_dt is not null
and datediff(a.lead_dt,a.dt)>100;
7.7 每个城市,不同性别,2018年支付金额最高的TOP3用户
select c.user_name,
c.city,
c.sex,
c.pay_amount,
c.rank
from
(select a.user_name,
b.city,
b.sex,
a.pay_amount,
row_number() over(partition by b.city,b.sex order by a.pay_amount desc) rank
from
(select user_name,
sum(pay_amount) pay_amount
from user_trade
where year(dt)=2018
group by user_name)a
left join user_info b on a.user_name=b.user_name)c
where c.rank<=3;
7.8 每个手机品牌退款金额前25%的用户
select *
from
(select a.user_name,
extra2['phonebrand'] as phonebrand,
a.refund_amount,
ntile(4) over(partition by extra2['phonebrand'] order by a.refund_amount desc) level
from
(select user_name,
sum(refund_amount) refund_amount
from user_refund
where dt>'0'
group by user_name)a
left join user_info b on a.user_name=b.user_name)c
where c.level=1;
7.9 统计每月的支付金额和每年的支付金额汇总
select a.dt,
sum(a.year_amount),
sum(a.month_amount)
from
(select substr(dt,1,4) as dt,
sum(pay_amount) year_amount,
0 as month_amount
from user_trade
where dt>'0'
group by substr(dt,1,4)
union all
select substr(dt,1,7) as dt,
0 as year_amount,
sum(pay_amount) as month_amount
from user_trade
where dt>'0'
group by substr(dt,1,7)
)a
group by a.dt;
7.10 统计每个品类的购买用户数(这里要用到一张新表user_goods_category,第一列是用户名,第二列是用户购买的物品种类)
先建个user_goods_category表
数据:
Abby clothes,food,electronics
Ailsa book,clothes,food
Albert clothes,electronics,computer
见表,查询:
create table user_goods_category(
user_name string,
category_detail string)
row format delimited fields terminated by '\t';
select b.category,
count(distinct a.user_name)
from user_goods_category a
lateral view explode(split(category_detail,',')) b as category
group by b.category;
7.11 计算出每12个月的用户累计支付金额
SELECT a.month,
a.pay_amount,
sum(a.pay_amount) over(order by a.month rows between 11 preceding and current row)
FROM
(SELECT substr(dt,1,7) as month,
sum(pay_amount) as pay_amount
FROM user_trade
WHERE dt>'0'
GROUP BY substr(dt,1,7))a;
7.12 计算出每4个月的最大退款金额
SELECT a.month,
max(a.refund_amount) over(order by a.month rows between 3 preceding and current row)
FROM
(SELECT substr(dt,1,7) as month,
sum(refund_amount) as refund_amount
FROM user_refund
WHERE dt>'0'
GROUP BY substr(dt,1,7))a;
7.13 退款时间间隔最长的用户
SELECT b.user_name,
b.refund_days
FROM
(SELECT a.user_name,
datediff(a.dt,a.lag_dt) refund_days,
rank() over(order by datediff(a.dt,a.lag_dt) desc) as rank
FROM
(SELECT user_name,
dt,
lag(dt) over(partition by user_name order by dt) lag_dt
FROM user_refund
WHERE dt>'0' )a
WHERE a.lag_dt is not null)b
WHERE b.rank=1;
7.14 每个性别,不同性别和手机品牌的退款金额分布
SELECT b.sex,
b.phonebrand,
sum(a.refund_amount)
FROM
(SELECT user_name,
sum(refund_amount) as refund_amount
FROM user_refund
WHERE dt>'0'
GROUP BY user_name)a
JOIN
(SELECT user_name,
sex,
extra2['phonebrand'] phonebrand
FROM user_info) b on a.user_name=b.user_name
GROUP BY b.sex,
b.phonebrand
GROUPING SETS (b.sex,(b.sex,b.phonebrand));
7.15 把每个用户购买的品类变成一行,品类间用逗号分隔
SELECT a.user_name,
concat_ws(',',collect_set(a.goods_category))
FROM
(SELECT user_name,
goods_category
FROM user_trade
WHERE dt>'0'
GROUP BY user_name,
goods_category)a
GROUP BY a.user_name;
7.16 2017,2018年按月累计去重的退款用户数
SELECT b.year,
b.month,
sum(b.user_num) over(partition by b.year order by b.month)
FROM
(SELECT a.year,
a.month,
count(distinct a.user_name) user_num
FROM
(SELECT year(dt) as year,
user_name,
min(month(dt)) as month
FROM user_refund
WHERE year(dt) in (2017,2018)
GROUP BY year(dt),
user_name)a
GROUP BY a.year,
a.month)b
ORDER BY b.year,
b.month
limit 24;