hive sql 当月剩余天数

转载

小题大作 2024-11-02 11:24:00

文章标签 hive sql 当月剩余天数 hive hadoop 数据仓库 ios 文章分类 Hive 大数据

目录
1.维度组合分析2.列换行行转列：3.字段类型转换4.四大by1.order by2.short by3.Distribute By （数据分发）: 4.Cluster By 案例5.文件存储格式：压缩1.行式存储2.列式存储

1.维度组合分析

sql 关键字 grouping sets
案例

数据

u1,a,app,andriod
u2,b,h5,andriod
u1,b,h5,ios
u1,a,h5,andriod
u3,c,app,ios
u4,b,app,ios
u1,a,app,ios
u2,c,app,ios
u5,b,app,ios
u4,b,h5,andriod
u6,c,h5,andriod
u2,c,h5,andriod
u1,b,xiao,ios
u2,a,xiao,andriod
u2,a,xiao,ios
u3,a,xiao,ios
u5,a,xiao,andriod
u5,a,xiao,ios
u5,a,xiao,ios

建表

create table user_shop_log(
user_id string,
shop string,
channle string ,
os string 
)
row format delimited fields terminated by ',';

插入数据：load data local inpath "/home/hadoop/tmp/data/user_log.txt" into table user_shop_log;
问题1

1.每个店铺的访问次数

select 
shop,
count(user_id) as cnt 
from user_shop_log
group by 
shop ;

hive sql 当月剩余天数_hive

2.每个店铺每个用户的访问次数

select
shop,
user_id,
count(user_id) as cnt
from user_shop_log
group by shop,user_id;

hive sql 当月剩余天数_hadoop_02

3.每个店铺每个用户每个渠道的访问次数

select
shop,
user_id,
channle,
count(user_id) as cnt
from user_shop_log
group by shop,user_id,channle;

hive sql 当月剩余天数_hadoop_03

4.每个店铺每个用户每个渠道每个操作系统的访问次数

select
shop,
user_id,
channle,
os,
count(user_id) as cnt
from user_shop_log
group by shop,user_id,channle,os;

5.每个用户每个操作系统的登录次数

select
user_id,
os,
count(user_id) as cnt
from user_shop_log
group by user_id,os;

hive sql 当月剩余天数_数据仓库_04

6.每个渠道每个操作系统的浏览次数

select
channle,
os,
count(user_id) as cnt
from user_shop_log
group by channle,os;

hive sql 当月剩余天数_hadoop_05

维度组合分析 GROUPING SETS：

select 
user_id,
shop ,
channle  ,
os  ,
count(1)
from user_shop_log
group by 
user_id,
shop ,
channle  ,
os 
grouping sets(
(user_id),
(user_id,shop),
(user_id,channle),
(user_id,os),
(user_id,shop,channle),
(user_id,shop,os),
(user_id,shop,channle,os),
(shop),
(shop,channle),
(shop,os),
(shop,channle,os),
(channle),
(channle,os),
(os)
);

2.列换行行转列：

列换行

xxxx => array
案例

数据

zuan,王者荣耀
zuan,吃饭
zuan,rap
zuan,唱歌
甜甜,王者荣耀
甜甜,哥
甜甜,吃鸡

创表

create table t1(
name string ,
interesting string 
)
row format delimited fields terminated by ',';

插入数据：load data local inpath "/home/hadoop/tmp/data/t1.txt" into table t1;
命令

select 
name,
collect_list(interesting) as interestings,
concat_ws("|",collect_list(interesting)) as interestings_blk
from t1
group by name ;

hive sql 当月剩余天数_hive_06

3.字段类型转换

前提：
任何数据类型都可以转换成string
数值类型 string =》

1.四则运算是ok hive 优化
2.影响排序

案例

数据

字符串排序：按照字典序进行排序的 a-z
排序后会出现问题如下：
9000
900
1500
1000
100
解决思路：

1.修改表
2.类型转换

建表

create table t2(
sql string
);

插入数据：
load data local inpath "/home/hadoop/tmp/data/t2.txt" into table t2;
命令

select  
cast(sql  as bigint ) as sql_alias
from t2 
order by sql_alias;
//将string转换成bigint再排序

4.四大by

1.order by

1.全局排序
2.reduce 只有一个
使用方法

开启严格模式【一般是关闭的】
防止一些危险的查询是不被允许的

开启命令：set hive.mapred.mode=strict;
关闭命令：set hive.mapred.mode=nostrict;
当开启严格模式后 select * from emp_p; 将不好用
（emp_p 为分区表，其他正常表正常使用）
只能使用select * from emp_p where deptno=20; 进行分区查询

select * from emp order by empno limit 10;
将empno由小到大进行排序

2.short by

1.分区排序
2.reduce task 个数默认
3.不能保证全局有序
如果你的reduce task 个数是1 那么 order by 和sort by 效果是一样的

调制reduce task 个数：set mapred.reduce.tasks=2;
再进行查询，此时分为两部分

hive sql 当月剩余天数_数据仓库_07

查看结果数据
insert OVERWRITE LOCAL DIRECTORY '/home/hadoop/tmp/data/exemple/sortby'
select * from emp sort by empno;

hive sql 当月剩余天数_数据仓库_08

3.Distribute By （数据分发）:

数据
每季度的收入

2020,1w
2020,2w
2020,1w
2020,0.5w
2021,10w
2021,20w
2021,19w
2021,1.5w
2022,1.3w
2022,2w
2022,1w
2022,0.5w

建表

create table hive_distribute(
year string,
earning string
)
row format delimited fields terminated by ',';

插入数据：load data local inpath "/home/hadoop/tmp/data/exemple/distribute.txt" into table hive_distribute;
查询

reduce task 2

insert OVERWRITE LOCAL DIRECTORY '/home/hadoop/tmp/data/exemple/distribute'
select * from hive_distribute distribute by year sort by earning;
若rdeuce task=3
修改reduce task为默认：set mapred.reduce.task=-1;

4.Cluster By

了解

Cluster By 和 Distribute By主要用于分发数据
Cluster By 是Distribute By和sort by 的简写
使用方法
distribute by year sort by year 《=》 Cluster By year

分桶表

分桶表是hdfs上的文件

再dt后继续进行划分

案例

数据

1,name1
2,name2
3,name3
4,name4
5,name5
6,name6
7,name7
8,name8

建表

create table hive_bucket(
id int,
name string 
)
clustered by (id) into 4 buckets
row format delimited fields terminated by ",";

//以id分四个桶

插入数据：load data local inpath "/home/hadoop/tmp/data/exemple/bucket.txt" into table hive_bucket;
记得将reduce task 改为默认数值

5.文件存储格式

1.行式存储

1.含义

1.一行内容所有的列都在一个 block里面
2.里面的列掺杂很多数据类型

2.查询方式

行式存储加载所是把所有的列都查询出来再过滤出用户需要的列
如果用户仅仅查几个字段会导致磁盘io 开销比较大

3.行存储的使用

1.text file 文本文件
2.SequenceFile 文本文件

2.列式存储

1. 含义：按照列进行存储
2. 列式存储文件

1. RCFile ：初期时行到列的转变的产物（现在大多不适用）
2. ORC Files
3. Parquet

3. 适用场：查询几个列
4. 弊端：加载表中所有字段
5. 优点：列式存储文件数据量比行式存储的数据量少【前提都采用压缩】
6. 案例

建表命令

create table hive_distribute_col(
year string,
earning string
)
row format delimited fields terminated by ','
stored as orc;

插入数据命令

insert into table hive_distribute_col
select 
* 
from hive_distribute;

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：mysql一次性读1500不卡

下一篇：android Edittext 嵌套NestedScrollView设置内容撑不满

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯

hive sql 当月剩余天数

hive sql 当月剩余天数

1.维度组合分析

2.列换行 行转列：

3.字段类型转换

4.四大by

1.order by

2.short by

3.Distribute By （数据分发）:

4.Cluster By

案例

5.文件存储格式

1.行式存储

2.列式存储

51CTO博客

2.列换行行转列：