Hive常用函数总结一
- 1. select ..A.. from ..B.. where ..c..
- 查看列名字段desc
- 选出城市在北京,性别为女性的10名用户
- 查看交易表名desc
- 2. group by
- 2019年2月到5月,每个品类有多少人购买,累计金额是多少
- group by .. having..
- 3. order by...
- 2019年5月,支付金额最多的top 5 用户
- 5. from_unixtime unix_timestamp
- 把时间戳转换为日期
- 6. data_diff data_add data_sub
- 用户首次激活时间,与2019年5月1日的日期间隔
- 7. 条件函数
- case when
- if函数
- 8. 字符串函数
- 每个月新激活的用户数:
- 不同手机品牌的用户数get_json_object:
- 9. 聚合统计函数
- ELLA用户2018年平均支付金额,以及2018年最大支付日期和最小支付日期的时间间隔
- 用户激活时间在2018年,年龄段在20-30和30-40的婚姻状况分布
- 10. 执行顺序
1. select …A… from …B… where …c…
A:列名 B:表名 C:筛选条件
查看列名字段desc
选出城市在北京,性别为女性的10名用户
select user_name
from user_info
where city='beijing' and sex='female'
limit 10;
查看交易表名desc
对分区表,where 条件中必须对分区字段进行限制:
常见错误,未对分区进行限制示例:
select user_name, piece, pay_amount
from user_trade
where dt='2019-05-01' and good_category='food'
2. group by
2019年2月到5月,每个品类有多少人购买,累计金额是多少
select goods_category,
count(distinct user_name) as user_num,
sum(pay_amount) as total_amount
from user_trade
where dt between '2019-02-1' and '2019-05-01'
group by goods_category;
group by … having…
2019年4月,支付金额超过5万元的用户
select user_name, sum(pay_amount) as total_amount
from user_trade
where dt between 2019-04-01' and '2019-0430'
group by user_name having sum(pay_amount) >50000;
3. order by…
2019年5月,支付金额最多的top 5 用户
select user_name,sum(pay_amount) as total_amount
from user_trade
where dt between '2019-05-01' and '2019-05-30'
group by user_name
order by total_amount desc
limit 5;
ASC :升序(默认) DESC:降序
对多个字段进行排列时: order by A ASC, B DESC
5. from_unixtime unix_timestamp
把时间戳转换为日期
select pay_time,from_unixtime(pay_time,'yyyy-MM-dd hh:m:ss')
from user_trade
where dt= '2019-04-09';
from_unixtime(bigint unixtime, string format)
format:
yyyy-MM-dd hh:mm:ss
yyyy-MM-dd hh
yyyy-MM-dd hh:mm
yyyyMMdd
把日期转换为时间戳:
unix_timestamp(string date)
6. data_diff data_add data_sub
用户首次激活时间,与2019年5月1日的日期间隔
select user_name,
data_diff('2019-05-01',todate(firstactivetime))
from user_info
limt 10;
datadif (string enddate, string startdate)
日期增加减少函数:
data_add(string startdate, int days)
data_sub(string startdate, int days)
7. 条件函数
case when
统计20岁以下,20-30,30-40,40以上的用户数:
select case when age < 20 then '20s岁以下'
when age>=20 and age<30 then '20-30岁'
when age>=30 and age <40 then '30-40岁'
else ‘40岁以上’ end,
count (distinct user_id) user_num
from user_info
group by case when age<20 then '20岁以下'
when age>=20 and age<30 then '20-30岁'
when age>=30 and age<40 then '30-40岁'
else '40岁以上' end;
if函数
统计每个性别用户等级高低的分布情况:
select sex,
if(level>5,‘高’,‘低’),
count(distinct user_id) user_num
from user_info
group by sex,
if(level>5,'高','低');
8. 字符串函数
每个月新激活的用户数:
select substr(firstactivetime,1,7) as month
count(distinct user_id) as user_num
from user_info
group by substr(firstactivetime,1,7);
substr(string A, int start, int len)
不同手机品牌的用户数get_json_object:
select get_json_object(extral,'$.phonebrand') as phone_brand
count(distinct user_id) user_num
from user_info
group by get_json_object(extral, '$.phonebrand');
get_json_object(string json_string, string path)
json_string: 需要解析的json 字段
path: 用.key 取出想要获取的value;
9. 聚合统计函数
ELLA用户2018年平均支付金额,以及2018年最大支付日期和最小支付日期的时间间隔
select avg(pay_amount) as avg_amount,
datediff(max(from_unixtime(pay_time,"yyyy-MM-DD"),min(from_unixtime(pay_time,"yyyy-MM-dd")))
from user_trade
where year(dt)='2018' and user_name='ELLA';
聚合函数之间不允许互相嵌套:avg(count(*))
2018年购买商品品类两个以上的用户数:
select count(a.user_name)
from(
select user_name,
count(distinct goods_category) as category_num
from user_trade
where year(dt)='2018'
group by user_name having count(distinct goods_category )>2 )as a;
用户激活时间在2018年,年龄段在20-30和30-40的婚姻状况分布
select a.age_type,
if(a.marriage_status = 1,"已婚","未婚“”),
count(distinct a.user_id)
from(
select case when age<20 then "20岁以下"
when age >=20 and age < 30 then "20-30岁"
when age>=30 and age < 40 then "30-40岁"
else "40岁以上" end as age_type,
get_json_object(extra1, "$.marriage_status") as marriage_status,
user_id
from user_info
where to_date(firstactivetime) between '2018-01-01' and '2018-012-31' ) as a;
where a.age_type in ("20-30岁","30-40岁")
group by a.age_type,
if(a.marriage_status = 1,"已婚","未婚“”);
10. 执行顺序
注意执行顺序:order by 执行顺序在select 之后
from – where – group by – having – select – order by – limit