Hive官方网站上有详细的语法说明,参考LanguageManual。
这里我把最常用的几块列出来。
HIVE DDL
Database
建库语句
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
[COMMENT database_comment]
[LOCATION hdfs_path]
[MANAGEDLOCATION hdfs_path]
[WITH DBPROPERTIES (property_name=property_value, ...)];
关键字DATABASE和SCHEMA是一个意思。LOCATION指定了外部表的文件路径,MANAGEDLOCATION指定了管理表(普通表)的文件路径。要是搞不懂区别,往下看表部分的语句。
删库语句
DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];
删库默认是RESTRICT,如果里面有表的话,就不能删。使用CASCADE可以级联删除库里所有内容。
切换当前数据库
USE database_name;
--切换到默认库
USE DEFAULT;
切换当前库后,所有之后执行的命令都对这个库生效。
要想知道现在用的是哪个库,可执行 SELECT
Table操作
建表语句
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name -- (Note: TEMPORARY available in Hive 0.14.0 and later)
[(col_name data_type [column_constraint_specification] [COMMENT col_comment], ... [constraint_specification])]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[SKEWED BY (col_name, col_name, ...) -- (Note: Available in Hive 0.10.0 and later)]
ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
[STORED AS DIRECTORIES]
[
[ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)] -- (Note: Available in Hive 0.6.0 and later)
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)] -- (Note: Available in Hive 0.6.0 and later)
[AS select_statement]; -- (Note: Available in Hive 0.5.0 and later; not supported for external tables)
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
LIKE existing_table_or_view_name
[LOCATION hdfs_path];
data_type
: primitive_type
| array_type
| map_type
| struct_type
| union_type -- (Note: Available in Hive 0.7.0 and later)
primitive_type
: TINYINT
| SMALLINT
| INT
| BIGINT
| BOOLEAN
| FLOAT
| DOUBLE
| DOUBLE PRECISION -- (Note: Available in Hive 2.2.0 and later)
| STRING
| BINARY -- (Note: Available in Hive 0.8.0 and later)
| TIMESTAMP -- (Note: Available in Hive 0.8.0 and later)
| DECIMAL -- (Note: Available in Hive 0.11.0 and later)
| DECIMAL(precision, scale) -- (Note: Available in Hive 0.13.0 and later)
| DATE -- (Note: Available in Hive 0.12.0 and later)
| VARCHAR -- (Note: Available in Hive 0.12.0 and later)
| CHAR -- (Note: Available in Hive 0.13.0 and later)
array_type
: ARRAY < data_type >
map_type
: MAP < primitive_type, data_type >
struct_type
: STRUCT < col_name : data_type [COMMENT col_comment], ...>
union_type
: UNIONTYPE < data_type, data_type, ... > -- (Note: Available in Hive 0.7.0 and later)
row_format
: DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
[NULL DEFINED AS char] -- (Note: Available in Hive 0.13 and later)
| SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]
file_format:
: SEQUENCEFILE
| TEXTFILE -- (Default, depending on hive.default.fileformat configuration)
| RCFILE -- (Note: Available in Hive 0.6.0 and later)
| ORC -- (Note: Available in Hive 0.11.0 and later)
| PARQUET -- (Note: Available in Hive 0.13.0 and later)
| AVRO -- (Note: Available in Hive 0.14.0 and later)
| JSONFILE -- (Note: Available in Hive 4.0.0 and later)
| INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
column_constraint_specification:
: [ PRIMARY KEY|UNIQUE|NOT NULL|DEFAULT [default_value]|CHECK [check_expression] ENABLE|DISABLE NOVALIDATE RELY/NORELY ]
default_value:
: [ LITERAL|CURRENT_USER()|CURRENT_DATE()|CURRENT_TIMESTAMP()|NULL ]
constraint_specification:
: [, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ]
[, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ]
[, CONSTRAINT constraint_name FOREIGN KEY (col_name, ...) REFERENCES table_name(col_name, ...) DISABLE NOVALIDATE
[, CONSTRAINT constraint_name UNIQUE (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ]
[, CONSTRAINT constraint_name CHECK [check_expression] ENABLE|DISABLE NOVALIDATE RELY/NORELY ]
这个建表语句看起来非常复杂,前半部分是建表语句,后半部分是对数据类型和约束的描述。刚开始学可以不用关注太多复杂的东西,就把SQL里经常用的拿过来做试验即可。
管理表和外部表
管理表是Hive默认创建的表,外部表在Create后加上EXTERNAL关键字。它们的区别是,对于管理表,既管元数据,又管数据;对于外部表,只管元数据,不管数据。
| Managed Tables | External Tables |
ARCHIVE/UNARCHIVE | 支持 | 不支持 |
DROP | 删除数据和元数据 | 默认不删除数据,只删除元数据 |
ACID/Transactional | 支持 | 不支持 |
Query Results Caching | 支持 | 不支持 |
RELY constraint | 支持 | 支持 |
分区表
为了减少查询数据的量,可以按照查询条件对表做分区。分区表在HDFS的存储形式是按照分区字段创建存储目录,不同分区的数据放在不同的目录下。分区表在物理上按照分区字段独立开来。分区表有静态分区和动态分区两种形式,静态分区是人为给分区字段赋固定值,动态分区是根据分区字段的值自动分区。
删表语句
DROP TABLE [IF EXISTS] table_name [PURGE];
如果外部表设置了 external.table.purge=true,那么DROP的时候也会删除数据。
除了上面提到的库定义和表定义,DDL里还包含索引定义、视图定义、宏定义、函数定义和角色权限定义,请参考官网说明。
HIVE DML
Loading files into tables
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)] [INPUTFORMAT 'inputformat' SERDE 'serde'] (3.0 or later)
HIVE导入数据文件的时候,并没有做任何转换。导入文件到管理表,只是把文件复制到HDFS的存储目录下。导入文件到外部表,连文件复制的操作都没有。
--例子
CREATE TABLE tab1 (col1 int, col2 int) PARTITIONED BY (col3 int) STORED AS ORC;
LOAD DATA LOCAL INPATH 'filepath' INTO TABLE tab1;
Inserting data into Hive Tables from queries
Standard syntax:
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;
这里除了标准语法,还有HIVE的扩展语法,对于初学者来说,掌握标准语法就行。
--例子
INSERT INTO TABLE page_view PARTITION(dt='2008-06-08', country)
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.cnt FROM page_view_stg pvs
Inserting values into tables from SQL
Standard Syntax:
INSERT INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] ...)] VALUES values_row [, values_row ...]
--例子
CREATE TABLE students (name VARCHAR(64), age INT, gpa DECIMAL(3, 2));
INSERT INTO TABLE students
VALUES ('fred flintstone', 35, 1.28), ('barney rubble', 32, 2.32);
Update
Standard Syntax:
UPDATE tablename SET column = value [, column = value ...] [WHERE expression]
这个语法跟普通SQL没什么两样
Delete
Standard Syntax:
DELETE FROM tablename [WHERE expression]
不要太简单
由于HIVE的数据存放在HDFS中,HDFS不支持数据的修改和删除。因此,在使用hive的过程中,一般不会产生删除和更新的操作,如果需要支持这些操作,必须修改hive-site.xml
//在hive的hive-site.xml中添加如下配置:
<property>
<name>hive.support.concurrency</name>
<value>true</value>
</property>
<property>
<name>hive.enforce.bucketing</name>
<value>true</value>
</property>
<property>
<name>hive.exec.dynamic.partition.mode</name>
<value>nonstrict</value>
</property>
<property>
<name>hive.txn.manager</name>
<value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
</property>
<property>
<name>hive.compactor.initiator.on</name>
<value>true</value>
</property>
<property>
<name>hive.compactor.worker.threads</name>
<value>1</value>
</property>