Hive相关操作

转载

mob604756ff4913 2021-08-27 16:48:00

文章标签 hive 数据字段数据库 unix 文章分类 代码人生

一. 基本使用
二. 数据类型
- 1. 基本数据类型
- 2. 集合数据类型
三. DDL语言
四. 数据导入导出
五. HQL查询语法
六. 函数

一. 基本使用

1. 常用命令

1.1 -e 不进入hive交互窗口执行sql语句

命令行执行sql意义: 放入shell脚本

bin/hive -e "select * from demo"

1.2 -f 执行文件中的sql语句

bin/hive -f /tmp/demo.sql

1.3 传参

--hiveconf 设置配置参数

--hivevar 设置参数，key=value，使用${key}

2. 其他命令

2.1 退出hive交互窗口

exit;

quit;

2.2 在hive cli交互命令窗口中操作hdfs文件系统

dfs -ls /

dfs -rm -r /demo

2.3 在hive cli交互命令窗口中查看本地文件系统（beeline不支持）

!ls /tmp/demo

2.4 查看在hive交互窗口输入的历史命令

进入当前用户的根目录/root或/home/xuser
查看.hivehistory文件

cat .hivehistory

3. 常见属性配置

3.1 Hive数据仓库位置配置

default数据仓库最原始位置是在HDFS的/user/hive/warehouse路径下
在仓库目录下，没有对默认的数据库default创建文件夹，如果某张表属于default数据库，直接在数据仓库目录下创建一个文件夹
配置hive-site.xml的hive.metastore.warehouse.dir来修改位置

3.2 信息显示设置

显示当前数据库，配置hive-site.xml的hive.cli.print.current.db为true
查询结果显示表头信息，配置hive-site.xml的hive.cli.print.header为true

3.3 运行日志存储目录配置

默认存放在/tmp/root/hive.log目录下
复制/opt/module/hive/conf/hive-log4j.properties.template为hive-log4j.properties
在hive-log4j.properties中修改hive.log.dir=/opt/app/hive/logs

3.4 参数配置

3.4.1 查看当前所有配置信息

set;

3.4.2 参数设置方式

优先级依次增加，系统级参数如log4j设定必须使用前两种，因为参数读取在会话建立前完成

1> 配置文件

默认配置文件：hive-default.xml
用户自定义配置文件：hive-site.xml
Hive会读入hadoop的配置，因为Hive是作为Hadoop的客户端启动，Hive的配置会覆盖Hadoop的配置

2> 命令行

启动Hive时，设定参数

bin/hive -hiveconf mapred.reduce.tasks=5;

3> HQL

在HQL中使用set设置参数

set mapred.reduce.tasks=5;

set hive.exec.mode.local.auto=true;

二. 数据类型

1. 基本数据类型

1.1 数字类型

TINYINT
SMALLINT
INT
BIGINT
FLOAT
DOUBLE

1.2 时间类型

TIMESTAMP
DATE

1.3 字符串类型

STRING
VARCHAR 长度不定字符串，字符数1-65535
CHAR 长度固定字符串，最大字符数255

1.4 其他类型

BOOLEAN
BINARY

2. 集合数据类型

2.1 ARRAY

array<data_type>

--avatar,aang:katara:sokka,2005-02-21
create table t_movie(movie_name string,actors array<string>,first_show date)
row format delimited fields terminated by ','
collectin items terminated by ':'
--
select movie_name,actors[0] form t_movie;
select movie_name,actors from t_movie where array_contains(actors,'katara');
select movie_name,size(actors) from t_movie;

2.2 MAP

map<primitive_type,data_type>

--katara,husband:aang#brother:sokka,25
create table t_person(name string,family_members map<string,string>,age int)
row format delimited fields terminated by ','
collection items terminated by '#'
map keys terminated by ':';
--查询指定key的value
select name,family_members['brother'] as brother from t_person;
--查询所有key
select name,map_keys(family_members) as relation from t_person;
--查询所有value
select name,map_values(family_members) from t_person;
select name,map_values(family_members)[0] from t_person;

2.3 STRUCT

struct<col_name:data_type,...>

--katara,25:female:fuzhou
create table t_person(name string,info struct<age:int,sex:string,addr:string)
row format delimited fields terminated by ','
collection items terminated by ':';
--
select * from t_person;
select name,info.age from t_person;

三. DDL语言

1. 创建数据库

创建数据库，在HDFS的默认存储路径为/user/hive/warehouse/*.db

create database demo;

创建时判断是否存在

create database if not exist demo;

指定数据库在HDFS上的存储位置

create database demo location '/hive_db/demo.db'

2. 查询数据库

2.1 显示数据库

显示数据库

show databases;

通配符显示查询的数据库

show databases like 'demo*';

2.2 查看数据库详情

显示数据库信息

desc database demo;

显示数据库详细信息extended

desc database extended demo;

2.3 切换当前数据库

use demo;

3. 设置属性值

alter database demo set dbproperties('createtime'='20150211');

4. 删除数据库

删除空数据库

drop database demo;

删除时判断是否存在

drop database if exists demo;

删除非空数据库

drop database demo cascade;

5. 创建表

5.1 建表语法

create [external] table [if not exists] table_name
   [(col_name data_type [comment col_comment], ...)]
   [comment table_comment]
   [partitioned by (col_name data_type [comment col_comment], ...)]
   [clustered by (col_name, col_name, ...)
   [sorted by (col_name [asc|desc], ...)] into num_buckets buckets]
   [row format row_format]
   [stored as file_format]
   [location hdfs_path]
   [like]

5.2 字段说明

external

创建外部表

comment

为表或列添加注释

partitioned by

创建分区表

clustered by

创建分桶表

sorted by

进行桶内排序

row format

指定数据分割信息

delimited [fields terminated by char] [collection items terminated by char]
[map keys terminated by char] [lines terminated by char] 
| 
serde 'serde_class_name' [with serdeproperties (property_name=property_value, property_name=property_value, ...)]

stored as

指定存储文件类型
- sequencefile hadoop_kv序列文件
- textfile 文本
- rcfile 列式存储格式文件
- parquetfile 列式存储文件

location

指定表在HDFS上的存储位置

like

复制现有表结构，不复制数据

5.3 内部表（管理表）

默认创建的表是管理表，Hive会控制数据的生命周期，当删除一个管理表时，Hive也会删除这个表中的数据，管理表不适合和其他工具共享数据

5.4 外部表

删除外部表只删除元数据信息，不会删除数据

5.5 内部表和外部表互相转换

查询表类型详细属性

desc formatted demo;

内部表转为外部表

alter table demo set tblproperties('EXTERNAL'='TRUE');

外部表转为内部表

alter table demo set tblproperties('EXTERNAL'='FALSE');

'EXTERNAL'='TRUE'及'EXTERNAL'='FALSE'区分大小写

6. 分区表

分区表的数据可以按照某个字段的不同值, 存储在不同的子文件夹中
查询时通过where选择指定分区, 提高查询效率

6.1 静态分区

6.1.1 分区表操作

创建

create table demo(id string,info string)
partitined by (month string)
row format delimited fields terminated by '\t';

加载数据

load data local inpath '/demo1.txt' into table default.demo partition(month='201501');

load data local inpath '/demo1.txt' into table default.demo partition(month='201502');

load data local inpath '/demo1.txt' into table default.demo partition(month='201503');

查询分区表数据

--单分区查询
select * from demo where month='201501';
--多分区联合查询
select * from demo where month='201501'
union
select * from demo where month='201502'
union
select * from demo where month='201503';

增加分区

增加单个分区

alter table demo add partition(month='201504');

增加多个分区

alter table demo add partition(month='201504') partition(month='201505');

删除分区

删除单个分区

alter table demo drop partition(month='201504');

删除多个分区

alter table demo drop partition(month='201504') partition(month='201505');

查看分区

show partition demo;

查看分区表结构

desc formatted demo;

6.1.2 多级分区

partitioned by (region string, month string)

6.2 动态分区

6.2.1 机制与意义

底层机制：mapreduce中的多路输出mutipleOutputs（根据条件判断，将结果写入不同目录不同文件）
意义：自动完成数据划分，一次性处理多个分区数据

6.2.2 创建动态分区表

数据源表

create table demo2(
order_id string,
month string,
name string)
row format delimited fields terminated by '\t';

动态分区表

create table demo(
order_id string,
name string)
partitioned by (x string)
row format delimited fields terminated by '\t';

设置参数

set hive.exec.dynamic.partition=true; //使用动态分区

set hive.exec.dynamic.partition.mode=nonstrick; //无限制模式, 如果模式是strict，则必须有一个静态分区且放在最前面

set hive.exec.max.dynamic.partitions.pernode=10000; //每个节点生成动态分区的最大个数

set hive.exec.max.dynamic.partitions=100000; //生成动态分区的最大个数

set hive.exec.max.created.files=150000; //一个任务最多可以创建的文件数目

set hive.merge.mapfiles=true; //map端的结果进行合并

set mapreduce.reduce.tasks =20000; //设置reduce task个数，增加reduce阶段的并行度

加载数据

insert into table demo partition(x) --分区变量x应与demo分区变量名一致
select order_id,name,month from demo2; --select字段最后一个month会作为分区变量x的动态值

7. 修改表定义（元数据）

7.1 重命名表

alter table demo1 rename to demo2;

7.2 更新列

alter table demo change old_name new_name string;

7.3 增加列

增加字段，在partition字段前，其他字段后

alter table demo add columns(name string);

7.4 替换列

替换表中所有字段

alter table demo replace columns(name1 string, name2 int);

8. 删除表

drop table demo;

9. SerDe组件

SerDe是Serialize/Deserilize的简称，目的是用于序列化和反序列化，能为表切分、解析列，对列指定相应的数据
可以使用Hive自带SerDe或自定义SerDe
示例:

--文件内容
id=1,name=aang
id=2,name=katara
--期望输出
1	aang
2	katara
--
create table demo(id int, name string)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties("input.regex"="id=(.*),name=(.*)")
stored as textfile;

四. 数据导入导出

1. 数据导入

1.1 load

load data [local] inpath '/tmp/datas/demo.txt' [overwrite] into table demo [partition(patcol1=val1,...)];

local：从本地加载数据，否则从HDFS加载数据
inpath：表示加载数据的路径
overwrite：覆盖表中已有数据，否则表示追加
patititon：上传到指定分区

1.2 insert...values

insert into|overwrite table demo partition(month='201601') values(1,"katara");

1.3 insert...select

单表插入

insert into|overwrite table demo partition(month='201601')
select id,name from demo2 where month='201601';

多表插入

from demo2
insert into|overwrite table demo partiton(month='201602')
select id,name where month='201602'
insert into|overwrite table demo partiton(month='201603')
select id,name where month='201603';

1.4 create..as

create table if not exists demo
as
select id,name from demo2

1.5 location

create table demo(id int, name string)
row format delimited fields terminated by ','
location '/user/hive/warehouse/demo'

1.6 import

先export导出后，再将数据导入

import table demo partition(month='201601') from '/user/hive/warehouse/export/demo';

2. 数据导出

2.1 insert

将查询结果导出到本地

insert overwrite local directory '/tmp/avatar/demo' select * from demo;

将查询结果导出到HDFS

insert overwrite directory '/user/avatar/demo' select * from demo;

将查询结果格式化导出

insert overwrite local directory 'tmp/avatar/demo'
row format delimited fields terminaterd by '\t'
select * from demo;

2.2 export

导出到HDFS

export table default.demo to 'user/hive/warehouse/export/demo';

2.3 sqoop

导出到其他关系型数据库

3. 清除表中数据

只能清除内部表中数据

truncate table demo;

五. HQL查询语法

select [all | distinct] select_expr, select_expr, ...
  from table_reference
  [where where_condition]
  [group by col_list]
  [order by col_list]
  [cluster by col_list
  |
  [distribute by col_list] [sort by col_list]
  ]

1. with..as

临时表语法

with tmp as(
select ...from demo ...)
select ...from tmp ...;

2. select...from

2.1 列别名

紧跟列名，或加入as

select ename as name, deptno dn from dmeo;

2.2 limit

限制返回行数

2.3 distinct

select distinct id,name from demo;

3. where

逐行过滤

4. 运算符

4.1 算术运算符

+ - * / % & | ^ ~

4.2 比较运算符

 > < =
 a [not] between b and c
 a is null
 a is not null
 in(a,b) --显示列表中的值
 a [not] like b --string类型，b为简单正则表达式（_：一个字符，%：任意个字符）
 a [not] rlike b --string类型，b为标准正则表达式

4.3 逻辑运算符

and or not

5. 分组

5.1 group by

分组

5.2 having

分组过滤

6. join

--笛卡尔积
t1 join t2
--内连接，满足拼接条件才拼接
t1 join t2 on t1.id=t2.id
--左（外）连接left outer join，左表所有行都保留，连接不上的右表字段为null
t1 left join t2 on t1.id=t2.id
--右（外）连接right outer join，右表所有行都保留，连接不上的左表字段为null
t1 right join t2 on t1.id=t2.id
--全（外）连接full outer join，左右表的行都保留，连接不上的字段为null
t1 full join t2 on t1.id=t2.id
--左半连接，是sql中in子句的一个变种实现
--限制是右边的表只能在on子句中设置过滤条件,，在where子句，select子句都不行
t1 left semi join t2 on t1.id=t2.id
--等效于（新版本hive支持in子句）
t1 where id in (select id from t2)
--hive中不支持不等值join

连接中不建议用or，join on t1.id=t2.id or t1.addr=t2.addr，只能将两个表进行笛卡尔积连接，然后对连接后的数据进行过滤，若使用需设置：

set hive.strict.checks.cartesian.product=false

7. 排序

7.1 全局排序order by

可选子句：asc|desc

select * from demo order by salary;

7.2 Task内部排序sort by

对全局结果集来说不是排序
需设置reduce个数

set mapreduce.job.reduces=3

示例

select * from demo sort by salary;

7.3 分桶排序查询distribute by+sort by

distribute by类似mr中的partition，进行分区，可以结合sort by使用
需设置reduce个数

set mapreduce.job.reduces=3

示例

select * from demo distribute by deptno sort by id;

7.4 分桶排序cluster by

当distribute by和sort by字段相同时，可以使用cluster by代替
cluster by的排序只能是升序排序

8. 分桶及抽样查询

8.1 分桶表定义

对Hive表分桶可以将表中的数据按分桶键的哈希值散列到多个文件中，这些文件称为桶
表分区是用不同的子文件夹管理不同的数据

表分桶是用不同的文件管理不同的数据
创建分桶表

create table demo(id int,name string,province string,age int)
clustered by (province) sorted by (age) into 5 buckets
row format delimited fields terminated by ',';

向分桶表导入数据

方式1：直接导入文件，不建议，直接导入的文件可能不符合分桶表的分桶规范，若导入需设置：

set hive.strict.checks.bucketing=false;

方式2：通过一个中转表，使用insert...select导入

set mapreduce.job.reduces=-1（或目标分桶表的桶数）

insert into demo

select * from demo2;

8.2 分桶表的意义

join更快，join相同列划分的桶的表，可以使用map-side join，而且能提高计算的并行度，更加高效
sampling抽样更高效方便，没分桶的话需要扫描整个数据集

8.3 分桶抽样查询

对于非常大的数据集，有时需要的是一个具有代表性的查询结果而不是全部结果
语法: tablesample(bucket x out of y)
- y决定抽样的比例，必须是table总bucket数的倍数或因子
  - 例如：总bucket数为4，y=2时，抽取4/2=2个bucket的数据；y=8时，抽取4/8=1/2个bucket的数据
- x表示从哪个bucket开始抽取，如果需要取多个分区，以后的分区号为当前分区号加y，x的值必须小于y
  - 例如：总bucket数为4，y为2，x为1，表示抽取第1（x）个及第3（x+y）个bucket的数据
- 示例:

select * from demo tablesample(bucket 1 out of 2 on id)

六. 函数

1. 帮助指令

查看系统自带函数

show functions;

查看自带函数用法

desc function upper;

查看自带函数详细用法

desc function extended upper;

2. 单行运算函数

2.1 类型转换函数

cast('135' as int)
cast('20160211' as date)
cast(current_timestamp as date)

2.2 数学运算函数

round(6.5) --6，四舍五入
round(6.1345,3) --6.135
ceil(6.5) --7，向上取整
floor(6.5) --5，向下取整
abs(-6.5) --6.5，绝对值
greatest(1,3,6) --6，最大值
least(1,3,6) --1，最小值

2.3 时间函数

date_format

date_format('2016-04-25','yyyyMMdd')

current_timestamp

获取当前时间戳

current_date

获取当前日期

from_unixtime

unix时间戳转字符串格式

form_unixtime(bigint unixtime[,string format])

from_unix(unix_timestamp())
from_unix(unix_timestamp(),'yyyy/MM/dd HH:mm:ss')

unix_timestamp

字符串格式转unix时间戳

unix_timestamp() --不带参数，取当前时间的秒数时间戳
unix-timestamp('2016-02-11 12:00:00')
unix-timestamp('2016/02/11 12:00:00','yyyy/MM/dd HH:mm:ss')

to_date

字符串转日期

to_date('2016-02-11 12:00:00') --2016-02-11

datediff

求日期差

datediff('2016-02-11','2019-08-18')

date_add

往后推n天

date_add('2016-02-11',3)

date_sub

往前推n天

date_sub('2016-02-11',3)

next_day

下一个星期几的日期

next_day('2016-02-11','monday')

weekofyear

指定日期在一年中是第几周

weekofyear('2016-02-11')

2.4 字符串函数

substr/substring

截取子串

substr(string str,int start)
substr(string str,intstart,int len)

concat

拼接

concat(string a,string b,...)
concat(string sep,string a,string b,...)

length

长度

length('135389')

split

切分

split('192.168.1.101','\\.') --.为正则特定字符

upper/lower

转大小写

upper(string str)
lower(string str)

lpad/rpad

在a左/右边填充pad，到长度len

lpad(string a,int len,string pad)
rpad(string a,int len,string pad)

get_json_object

解析json

--xjson字段:{"id":"1","name":"katara"}
select get_json_object(xjson,'$.id'),get_json_object(xjson,'$.name') from demo;

base64

将二进制格式转为base64编码

base64(binary('135389'))

2.5 条件分支函数case

case [expression]
	when condition1 then result1
	when condition2 then result2
	...
	when condition3 then result3
	else result
end

2.6 条件分支函数if

if(expression,result1,result2)

2.7 集合函数

array_contaions(Array<T>,value)

是否包含

sort_array(Array<T>)

返回排序后数组

size(Array<T>)

size(Map<K,V>)

map_keys(Map<K,V>)

返回所有key

map_values(Map<K,V>)

返回所有value

2.8 NULL相关函数

nvl(expr1,expr2) --如果第一个参数为空，则取第二个；如果第一个参数不为空，则取第一个
nvl2(expr1,expr2,expr3) --如果第一个参数为空，则取第二个；如果第一个参数不为空，则取第三个
coalesce(expr1,expr2...exprn) --取第一个不为空的值，都为空则为NULL
nullif(expr1,expr2) --如果两参数相等则返回NULL

3. 窗口分析函数

3.1 序号函数

row_number() over()

相同值序号不同

row_number() over (partition by col1 order by col2)

rank() over()

相同值的序号相同，下一个值的序号跳变

dense_rank() over()

相同值的序号相同，下一个值的序号连续

3.2 sum

sum() over()在窗口内，对指定的行进行滚动累加
行的运算范围：
- 往前：n/unbounded preceding
- 往后：n/unbounded following
- 当前：current row

sum(amount) over(partition by id order by mth rows between unbounded preceding and current row)
--不指定运算范围，但指定排序，默认为：最前->当前
sum(amount) over(partition by name order by mth)
--不指定运算范围，且不指定排序，默认为：最前->最后
sum(amount) over(partition by name)
--不指定运算范围，且不指定排序，且不指定窗口，则将整个表视为一个窗口
sum(amount) over()

3.3 count()

计数

3.4 lead()/lag()

--将下一行的某个字段输出
lead(amount,1) over(partition by name order by mth)
--将上一行的某个字段输出
lag(amount,1) over(partition by name order by mth)

3.5 first_value()/last_value()

窗口某个字段的第一个/最后一个

4. 分组聚合函数

sum() --总和
avg() --平均
max() --最大
min() --最小
count() --总行数
collection_list() --列转行，不去重
collection_set() --列转行，去重

5. 表生成函数

5.1 行转列函数explode()

select explode(subjects) as subject from demo;

5.2 表生成函数lateral view

相当于两表join，左表是原表，右表是explode（某个集合字段）之后产生的表，这个join只在同一行的数据间进行

select o.name,o.subject from
(select name,tmp.subject as subject
 from demo lateral view explode(subjects) tmp as subject) o

5.3 json解析函数

{"movie":"135","rate":"5","timeStamp":"958500750","uid":"1"}
create table demo
as
select json_tuple(xjson,'movie','rate','timeStamp','uid') as(movie,rate,ts,uid)
from demo2;

6. 高阶聚合函数

with cube

数据立方体，可以实现所有维度的组合情况，是grouping sets的简化版

grouping sets

自由指定要计算的维度组合

with rollup

实现维度key存在级联关系的统计，维度key从右到左递减多级的统计，如省市区、省市、省

7. 自定义函数

依赖为org.apache.hive,hive-exec
继承org.apache.hadoop.hive.ql.UDF
实现evaluate函数，支持重载
在hive的命令行窗口创建函数
- 添加jar
  
  add jar /opt/module/datas/udfdemo.jar'
- 创建function
  
  create [temporary] function mf as 'com.avtar.func.MyFunction';
在hive的命令行窗口删除函数

drop [temporary] function [if exists] mf;
UDF必须有返回类型，可以返回null，但返回类型不能为void