hive根据已有表分区建表

转载

数据探索者 2024-09-14 13:04:41

文章标签 hive根据已有表分区建表 hive big data 外部表数据 文章分类 Hive 大数据

1.hadoop hive zeppelin启停脚本

my_start(){
	if [ $1 == "start" ]; then
		# start hadoop
		sh /opt/soft/hadoop260/sbin/start-dfs.sh
		sh /opt/soft/hadoop260/sbin/start-yarn.sh
		# start hive 后台启动
		nohup /opt/soft/hive110/bin/hive --service hiveserver2 &
		# start zeppelin
		sh /opt/soft/zeppelin081/bin/zeppelin-daemon.sh start
		echo "start complete"
	else
		# close zeppelin
		sh /opt/soft/zeppelin081/bin/zeppelin-daemon.sh stop
		# close hive RunJar进程可能有多个，所以循环关闭RunJar进程
		hiveprocess=`jps | grep RunJar | awk '{print $1}'`
		for num in $hiveprocess; do
			kill -9 $num
		done
		# stop hadoop
		sh /opt/soft/hadoop260/sbin/stop-dfs.sh
		sh /opt/soft/hadoop260/sbin/stop-yarn.sh
		echo "stop complete"
	fi
}

my_start $1

2.hive命令行模式

当hive后台启动时，想要在黑窗口操作hive，就需要使用到hive命令行模式

hive -e 'show databases'
hive -e 'use mydemo;show tables'
hive -e 'select * from mydemo.userinfos'

3.hive内部表&外部表

在删除内部表时数据会根据你的命令，除了删表结构外还删除表中的数据
内部表的数据一般会存储在与表同名的文件夹下，可以根据需要压缩orc parquet
内部表查询速度比较快
在删除外部表时数据会根据你的命令，只删除表结构，不删除表中的数据
外部表的数据存储位置随意
外部表无法进行压缩
外部表查询速度相比于内部表慢
外部表用在映射用户的原始数据
未被external修饰的是内部表（managed table），被external修饰的为外部表（external table）
内部表数据由Hive自身管理，外部表数据由HDFS管理
内部表数据存储的位置是hive.metastore.warehouse.dir（默认：/user/hive/warehouse，这个地方写在配置文件/opt/soft/hive110/conf/hive-site.xml中，可以自行更改合适的位置），外部表数据的存储位置在创建的时候用location自己决定

创建1个压缩格式为ORC的内部表

%hive
create table mydemo.inside(
	id string,
	name string
)
row format delimited fields terminated by ','
stored as orc

创建一个外部表

%hive
create external table mydemo.outside(
	id string,
	name string
)
row format delimited fields terminated by ','
location '/tab/outside'  -- 这里是外部表位于hdfs上的目录

查看内外表格

可以在mysql中查看表TBLS，会显示出所有的内部表和外部表

select * from hive.`TBLS`

上传数据文件到外部表的文件夹

%sh
hdfs dfs -put /opt/data/data.txt /tab/outside

此时数据就进入外部表outside了

创建特殊数据类型的表

%hive
create external table mydemo.outside(
	id string,
	name string,
	job ARRAY<string>,
	sex_age STRUCT<sex:int,age:int>,
	skill MAP<string,string>
)
row format delimited fields terminated by ' '   -- 如何分割列(字段)，这里是用空格分割
collection items terminated by ','    -- 如何分割集合
map keys terminated by ':'             -- 如何分割映射
location '/tab'

准备数据

1 张三编码,调试,运维 1,30 skill1:吃饭,skill2:睡觉
1 李四编码,运维 0,25 skill1:吃饭,skill2:游戏
1 王五运维,排故 1,20 skill1:羽毛球,skill2:唱歌
1 赵六设计,调配 0,32 skill1:美容,skill2:逛街

导入数据

%sh
hdfs dfs -put /opt/data/data.txt /tab

从特殊表中查询数据

%hive
select id,job[0],sex_age.age,skill['skill1'] from mydemo.outside

csv文件作为数据源

创建目录

%sh
hdfs dfs -mkdir -p /tab1

将数据源上传到目录

%sh
hdfs dfs -put /opt/data/data.csv /tab1

创建表

%hive
create external table mydemo.usi(
    id string,
    name string,
    bir string,
    likes string
)
row format delimited fields terminated by ','
location '/tab1'

hive表删除第一行

新建表时过滤首行：

hive > create table hive_movies
(
     rank int,
     src string,
     name string,
     box_office string,
     avg_price int,
     avg_people int,
     begin_date string
) row format delimited fields terminated by ','
TBLPROPERTIES ('skip.header.line.count'='1');

如果表已经建好了，可使用如下命令:

hive > alter table hive_movies set TBLPROPERTIES ('skip.header.line.count'='1');

hive如果一列中有这样的数据 eat，sleep

%hive
create external table mydemo.usi(
    id string,
    name string,
    bir string,
    likes string
)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties
(
    'separatorChar'=',',
    'quoteChar'='\"',
    'escapeChar'='\\'
)
location '/tab1'
tblproperties('skip.header.line.count'='1')  --忽略首行

根据已有的表创建新的表(建表高阶语句)

create table mydemo.usi1 as select * from mydemo.usi  连数据带结构
create table mydemo.usi2 like mydemo.usi              只复制结构，数据不复制，只有hive有此功能，mysql和oracle没有

这种创表方式不推荐
create table mydemo.us3 as
但是下面这种查询方式及其推荐
with
r1 as (select userid id,username name from mydemo.userinfos),
r2 as (select id,name from mydemo.usi)
select * from r1 union all select * from r2

加载数据

load加载数据 ETL load加载数据格式不能转换
写local,从linux上复制文件到hive
写overwrite是全量表写法,不写overwrite是增量表
比如城市表:用overwrite，做全量导入
比如订单表:不用overwrite,做增量导入
%hive
load data local inpath '/opt/data/dd.txt' overwrite into table mydemo.test1

不写local，就从hdfs上面获取数据

临时表

%hive
create temporary table mydemo.mytem(
    id string,
    name string
)
row format delimited fields terminated by ','
stored as textfile

load加载数据 ETL load加载数据格式不能转换

%hive
load data local inpath '/opt/data/dd.txt' overwrite into table mydemo.test1

分区表

%hive
create table mydemo.my_part(
    id string,
    name string
)
partitioned by (birmonth string)
row format delimited fields terminated by ','

手工创建一个静态分区插入数据，有两种方案

%hive
alter table mydemo.my_part add partition(birmonth='01')

方案一

%hive
load data local inpath '/opt/data/dd.txt' overwrite into table mydemo.my_part partition(birmonth='01')

方案二

%hive
insert into mydemo.my_part partition(birmonth='01') select id,name from mydemo.usi

动态分区

%hive
-- 动态分区是根据查询分区列的有几种变化就有几个分区(列值基数)
-- 默认的最大分区的个数为100个 可以修改最大分区个数
-- set hive.exec.max.dynamic.partitions=1000
-- set hive.exec.max.dynamic.partitions.pernode=1000

-- set hive.exec.dynamic.partition=true
-- set hive.exec.dynamic.partition.mode=nonstrict
insert into mydemo.my_part1 partition(birmonth) select id ,name,month(regexp_replace(bir,"/",'-')) birmonth from mydemo.usi

分桶和分区的区别

分区是分文件夹
分桶各个桶是文件
分桶是桶列的哈希值取余，类似Hadoop里面shuffle过程中的哈希分区

explode和posexplode函数

把数据炸开

%hive
select explode(split(likes,',')) from mydemo.usi

%hive
select id,name,loc,ind from mydemo.usi lateral view posexplode(split(likes,',')) a as loc,ind

Hive高级查询

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：mysql查询图片

下一篇：ubuntu mysql内存优化

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯

hive根据已有表分区建表

hive根据已有表分区建表

1.hadoop hive zeppelin启停脚本

2.hive命令行模式

3.hive内部表&外部表

创建1个压缩格式为ORC的内部表

创建一个外部表

查看内外表格

上传数据文件到外部表的文件夹

创建特殊数据类型的表

准备数据

导入数据

从特殊表中查询数据

csv文件作为数据源

hive表删除第一行

hive如果一列中有这样的数据 eat，sleep

根据已有的表创建新的表(建表高阶语句)

加载数据

临时表

load加载数据 ETL load加载数据格式不能转换

分区表

手工创建一个静态分区插入数据，有两种方案

动态分区

分桶和分区的区别

explode和posexplode函数

Hive高级查询

51CTO博客