hadoop hive 自带 hadoop中hive

转载

mob6454cc7042a2 2023-07-12 11:15:36

文章标签 hadoop hive 自带 hive hadoop hdfs 大数据 文章分类 Hadoop 大数据

Hive详解

一、Hive简介
二、Hive总体架构

hive架构图如下：
Hive基本组成

三、Hive特点

优点
缺点

四、Hive基本语法

1、Hive DDL语法
2、Hive DML语法

五、总结

一、Hive简介

hive是什么？
“懒人改变世界”，这是我一直坚信的理念。并不是提倡大家偷懒，而是要有懒人的思想，要想尽一切办法来减少自己的工作量，减少重复劳动力，提高生产效率。没有hive之前，作为程序猿需要对hadoop有详细的了解，要写复杂的MapReduce开发难度相当大，以及掌握MapReduce运行机制原理等等。对于一个新手来说使用Hadoop的学习成本、以及使用成本都非常高，出于这些原因，于是乎聪明赖堕的程序猿创造了Hive。
简单的说，hive是一种数据仓库工具。在Hadoop之上，提供两个核心功能：
1、将HDFS里面结构化的数据文件映射为一张类似于关系型数据库中的一张表，一张表对应一个文件；
2、提供一种SQL的语法访问表中的数据，把SQL解析自动转换为MapReduce提交到Hadoop集群执行。

二、Hive总体架构

hive架构图如下：

hadoop hive 自带 hadoop中hive_hdfs

Hive基本组成

1、Client，hive客户端主要提供3种接口，CLI、JDBC/ODBC和WebGUI。其中，CLI为hive shell命令行；JDBC/ODBC是Hive的JAVA实现，与传统数据库JDBC类似；WebGUI是通过浏览器访问Hive。

2、Metastore，元数据存储，Hive 将元数据存储在数据库中。Hive 中的元数据包括表的名字，表的列和分区及其属性，表的属性（是否为外部表等），表的数据所在目录等。

3、Driver，驱动器，包括解释器、编译器、优化器完成 SQL 语句从词法分析、语法分析、编译、优化以及生成执行计划，最后生成MapReduce提交给Hadoop执行。

到Hadoop后的流程请看我前面文章《Hadoop基础简介》，详细介绍。

三、Hive特点

优点

1、支持高吞吐量，批量，海量数据处理；
2、语法简单，和SQL相似，学习成本低，避免去写复杂的MapReduce，缩短开发周期；
3、扩展性强，Hive支持自由的扩展集群的规模，一般不需要重启服务；
4、延展性好，Hive支持自定义函数，用户可以根据自己的需求去定义函数；
5、良好的容错性，节点出现问题，SQL仍然可以成功执行。

缺点

1、Hive的SQL表达能力有限
（1）迭代式算法无法表达（及有些需求没有用一个SQL来解决，需要从一个MapReduce再到另一个MapReduce才能解决）
（2）数据挖掘方面不擅长。

2、Hive的效率比较低
（1）Hive自动生成的MapReduce作业，通常情况下不够智能化；
（2）Hive调优比较困难，粒度较粗。

四、Hive基本语法

1、Hive DDL语法

数据库DDL
1）Create Database创建数据库

create (database|schema) [if not exists] database_name
  [comment database_comment]
  [location hdfs_path]
  [managedlocation hdfs_path]
[with dbproperties (property_name=property_value, ...)];

说明
Comment 加备注
Location 指定数据库表路径，Hive4以下，内外部表存放目录，Hive4+外部表存放目录
Managedlocation hive4+内部表存放目录
with dbproperties 为数据库添加描述信息

例子：

create database if not exists test 
comment 'my test db'
location '/myhive/myoutdb'
managedlocation '/myhive/myindb'
with dbproperties ('creator'='ypc','date'='2021-03-09');

2) drop database 删除数据库

drop (database|schema) [if exists] database_name [restrict|cascade];

3) alter database 修改数据库

alter (database|schema) database_name set dbproperties (property_name=property_value, ...);   -- (note: schema added in hive 0.14.0)

alter (database|schema) database_name set owner [user|role] user_or_role;   -- (note: hive 0.13.0 and later; schema added in hive 0.14.0)

alter (database|schema) database_name set location hdfs_path; -- (note: hive 2.2.1, 2.4.0 and later)

alter (database|schema) database_name set managedlocation hdfs_path; -- (note: hive 4.0.0 and later)

4) use database

use database_name;
use default;

表DDL
1) Create Table 建表

create [temporary] [external] table [if not exists] [db_name.]table_name    -- (note: temporary available in hive 0.14.0 and later)
  [(col_name data_type [column_constraint_specification] [comment col_comment], ... [constraint_specification])]
  [comment table_comment]
  [partitioned by (col_name data_type [comment col_comment], ...)]
  [clustered by (col_name, col_name, ...) [sorted by (col_name [asc|desc], ...)] into num_buckets buckets]
  [skewed by (col_name, col_name, ...)                  -- (note: available in hive 0.10.0 and later)]
     on ((col_value, col_value, ...), (col_value, col_value, ...), ...)
     [stored as directories]
  [
   [row format row_format] 
   [stored as file_format]
     | stored by 'storage.handler.class.name' [with serdeproperties (...)]  -- (note: available in hive 0.6.0 and later)
  ]
  [location hdfs_path]
  [tblproperties (property_name=property_value, ...)]   -- (note: available in hive 0.6.0 and later)
  [as select_statement];   -- (note: available in hive 0.5.0 and later; not supported for external tables)
 
create [temporary] [external] table [if not exists] [db_name.]table_name
  like existing_table_or_view_name
  [location hdfs_path];
 
data_type
  : primitive_type
  | array_type
  | map_type
  | struct_type
  | union_type  -- (note: available in hive 0.7.0 and later)
 
primitive_type
  : tinyint
  | smallint
  | int
  | bigint
  | boolean
  | float
  | double
  | double precision -- (note: available in hive 2.2.0 and later)
  | string
  | binary      -- (note: available in hive 0.8.0 and later)
  | timestamp   -- (note: available in hive 0.8.0 and later)
  | decimal     -- (note: available in hive 0.11.0 and later)
  | decimal(precision, scale)  -- (note: available in hive 0.13.0 and later)
  | date        -- (note: available in hive 0.12.0 and later)
  | varchar     -- (note: available in hive 0.12.0 and later)
  | char        -- (note: available in hive 0.13.0 and later)
 
array_type
  : array < data_type >
 
map_type
  : map < primitive_type, data_type >
 
struct_type
  : struct < col_name : data_type [comment col_comment], ...>
 
union_type
   : uniontype < data_type, data_type, ... >  -- (note: available in hive 0.7.0 and later)
 
row_format
  : delimited [fields terminated by char [escaped by char]] [collection items terminated by char]
        [map keys terminated by char] [lines terminated by char]
        [null defined as char]   -- (note: available in hive 0.13 and later)
  | serde serde_name [with serdeproperties (property_name=property_value, property_name=property_value, ...)]
 
file_format:
  : sequencefile
  | textfile    -- (default, depending on hive.default.fileformat configuration)
  | rcfile      -- (note: available in hive 0.6.0 and later)
  | orc         -- (note: available in hive 0.11.0 and later)
  | parquet     -- (note: available in hive 0.13.0 and later)
  | avro        -- (note: available in hive 0.14.0 and later)
  | jsonfile    -- (note: available in hive 4.0.0 and later)
  | inputformat input_format_classname outputformat output_format_classname
 
column_constraint_specification:
  : [ primary key|unique|not null|default [default_value]|check  [check_expression] enable|disable novalidate rely/norely ]
 
default_value:
  : [ literal|current_user()|current_date()|current_timestamp()|null ] 
 
constraint_specification:
  : [, primary key (col_name, ...) disable novalidate rely/norely ]
    [, primary key (col_name, ...) disable novalidate rely/norely ]
    [, constraint constraint_name foreign key (col_name, ...) references table_name(col_name, ...) disable novalidate 
    [, constraint constraint_name unique (col_name, ...) disable novalidate rely/norely ]
[, constraint constraint_name check [check_expression] enable|disable novalidate rely/norely ]

对几个关键点说明
a) temporary 创建临时表，只在本次回话生效
b) externa关键字，加上这个关键字建的表是外部表，不加这个关键字建的表就是内部表
内部表和外部表的区别：
(1）概念本质上
内部表数据自己的管理的在进行表删除时数据和元数据一并删除。
外部表只是对HDFS的一个目录的数据进行关联，外部表在进行删除时只删除元数据，原始数据是不会被删除的。
(2）应用场景上
外部表一般用于存储原始数据、公共数据，内部表一般用于存储某一个模块的中间结果数据。
(3）存储目录上
外部表：一般在进行建表时候需要手动指定表的数据目录为共享资源目录，用lication关键字指定。
内部表：无严格的要求，一般使用的默认目录。
c) partitioned by 指定分区字段
partitioned by（分区字段名分区字段类型 COMMENT 字段描述信息）

注意：分区字段一定不能存在于建表字段中。

d) [row format row_format] 指定分割符的
fields terminated by 列分割符
lines terminated by 行分割符
map keys terminated by
e) [stored as AS file_format] 指定原始数据的存储格式
textfile 文本格式默认的方式
cfile 行列格式，在行的方向切分数据的存储的块保证一行数据在一个数据块中，每列个块中存储的时候进行划分存储的。
SequenceFile 二进制存储格式

例子：
创建内部表：

create table if not exists student(
    stu_id string comment "id",
    name string comment "姓名",
    grade string comment "年级",
    class string comment "班级"
)COMMENT "学生表"
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
STORED AS TEXTFILE;

创建分区表

create table if not exists system_logs(
    id         string comment "id",
    data  string comment "数据",
    status     string comment "状态码",
    log_info string comment "日志",
    created_by         string comment '创建人',
    created_time       string comment '创建时间',
)
COMMENT "系统日志表"
partitioned by (day string comment "格式：yyyy-MM-dd")
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
STORED AS TEXTFILE;

创建外部表

create external table if not exists test_external(
	id int,
	sex string,
	age int,
	date string,
	role string,
	region string
) 
COMMENT "外部测试表"
row format delimited 
fields terminated by '\001' 
stored as textfile location '/user/hdfs/source/hive_test';

创建ES外部表

set es.net.http.auth.user=elastic;
set es.net.http.auth.pass=123456789;
add jar hdfs://user/hdfs/source /lib/elasticsearch-hadoop-7.5.1.jar; --添加es架包
create external table if not exists es_external_test
(
    id string comment "id",
    name string comment "名称",
    code string comment "编号",
    describe string comment "描述",
    day string comment '日期 yyyy-MM-dd'
)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource'='idx_ es_external_test_log-{day|yyyy-MM}/_doc',
'es.index.auto.create'='true',
'es.batch.size.bytes'='10mb',
'es.batch.size.entries'='0',
'es.batch.write.refresh'='true',
'es.batch.write.retry.count'='10000',
'es.batch.write.retry.wait'='30s',
'es.write.operation'='index',
'es.nodes' = '192.168.0.10:9200, 192.168.0.12:9200',
'es.index.read.missing.as.empty'='true'
);

2）alter table 修改表

alter table tablename rename to newname;

3）表/分区数据的清空

truncate table tablename; --清空表
truncate table tablename partition(name=value); -- 清空某一个分区的数据

4）删除表

drop table if exists tablename;

2、Hive DML语法

1）表数据加载（load）

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)] [INPUTFORMAT 'inputformat' SERDE 'serde'] (3.0 or later)

例子

load data local inpath '/home/hadoop/data/emp.txt' into table test;

2）insert数据

Standard syntax:
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;
 
Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2]
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;
FROM from_statement
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2]
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] ...;
 
Hive extension (dynamic partition inserts):
INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;
INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;

说明
insert into table test select … 追加方式写
insert overwrite table test select … 覆盖原表所有数据写

有分区表的插入(插入分区表需额外设置参数)

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert into table test partition (day—分区字段) select …,day from …

3）select 查询
查询是最重要也是最常用的，hive sql select 语法和我们熟知的sql查询类似，支持所有的标准sql语法，这里就不做过多介绍。

值得注意的一点是，有聚合查询是需要设置队列

set mapred.job.queue.name=root.queue_test;

五、总结

hive的出现大大减少了我们程序猿学习成本，缩短我们开发周期，提高了我们效率。作为一个大数据小白，无需了解hadoop详细知识和原理，我们只要了解我熟悉的sql就可以很好的使用hive，从而使用hadoop，完成我们的业务需求。

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：hbase 判断集群搭建 hbase group by

下一篇：hadoop hive怎么做权限控制 hadoop3 hive

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯