Hive官方网站上有详细的语法说明,参考LanguageManual

这里我把最常用的几块列出来。

HIVE DDL

Database

建库语句
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
  [COMMENT database_comment]
  [LOCATION hdfs_path]
  [MANAGEDLOCATION hdfs_path]
  [WITH DBPROPERTIES (property_name=property_value, ...)];

 关键字DATABASE和SCHEMA是一个意思。LOCATION指定了外部表的文件路径,MANAGEDLOCATION指定了管理表(普通表)的文件路径。要是搞不懂区别,往下看表部分的语句。

删库语句
DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];

删库默认是RESTRICT,如果里面有表的话,就不能删。使用CASCADE可以级联删除库里所有内容。

切换当前数据库
USE database_name;
--切换到默认库
USE DEFAULT;

切换当前库后,所有之后执行的命令都对这个库生效。

要想知道现在用的是哪个库,可执行 SELECT

Table操作

建表语句
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name    -- (Note: TEMPORARY available in Hive 0.14.0 and later)
  [(col_name data_type [column_constraint_specification] [COMMENT col_comment], ... [constraint_specification])]
  [COMMENT table_comment]
  [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
  [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
  [SKEWED BY (col_name, col_name, ...)                  -- (Note: Available in Hive 0.10.0 and later)]
     ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
     [STORED AS DIRECTORIES]
  [
   [ROW FORMAT row_format] 
   [STORED AS file_format]
     | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]  -- (Note: Available in Hive 0.6.0 and later)
  ]
  [LOCATION hdfs_path]
  [TBLPROPERTIES (property_name=property_value, ...)]   -- (Note: Available in Hive 0.6.0 and later)
  [AS select_statement];   -- (Note: Available in Hive 0.5.0 and later; not supported for external tables)
 
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
  LIKE existing_table_or_view_name
  [LOCATION hdfs_path];
 
data_type
  : primitive_type
  | array_type
  | map_type
  | struct_type
  | union_type  -- (Note: Available in Hive 0.7.0 and later)
 
primitive_type
  : TINYINT
  | SMALLINT
  | INT
  | BIGINT
  | BOOLEAN
  | FLOAT
  | DOUBLE
  | DOUBLE PRECISION -- (Note: Available in Hive 2.2.0 and later)
  | STRING
  | BINARY      -- (Note: Available in Hive 0.8.0 and later)
  | TIMESTAMP   -- (Note: Available in Hive 0.8.0 and later)
  | DECIMAL     -- (Note: Available in Hive 0.11.0 and later)
  | DECIMAL(precision, scale)  -- (Note: Available in Hive 0.13.0 and later)
  | DATE        -- (Note: Available in Hive 0.12.0 and later)
  | VARCHAR     -- (Note: Available in Hive 0.12.0 and later)
  | CHAR        -- (Note: Available in Hive 0.13.0 and later)
 
array_type
  : ARRAY < data_type >
 
map_type
  : MAP < primitive_type, data_type >
 
struct_type
  : STRUCT < col_name : data_type [COMMENT col_comment], ...>
 
union_type
   : UNIONTYPE < data_type, data_type, ... >  -- (Note: Available in Hive 0.7.0 and later)
 
row_format
  : DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
        [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
        [NULL DEFINED AS char]   -- (Note: Available in Hive 0.13 and later)
  | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]
 
file_format:
  : SEQUENCEFILE
  | TEXTFILE    -- (Default, depending on hive.default.fileformat configuration)
  | RCFILE      -- (Note: Available in Hive 0.6.0 and later)
  | ORC         -- (Note: Available in Hive 0.11.0 and later)
  | PARQUET     -- (Note: Available in Hive 0.13.0 and later)
  | AVRO        -- (Note: Available in Hive 0.14.0 and later)
  | JSONFILE    -- (Note: Available in Hive 4.0.0 and later)
  | INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
 
column_constraint_specification:
  : [ PRIMARY KEY|UNIQUE|NOT NULL|DEFAULT [default_value]|CHECK  [check_expression] ENABLE|DISABLE NOVALIDATE RELY/NORELY ]
 
default_value:
  : [ LITERAL|CURRENT_USER()|CURRENT_DATE()|CURRENT_TIMESTAMP()|NULL ] 
 
constraint_specification:
  : [, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ]
    [, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ]
    [, CONSTRAINT constraint_name FOREIGN KEY (col_name, ...) REFERENCES table_name(col_name, ...) DISABLE NOVALIDATE 
    [, CONSTRAINT constraint_name UNIQUE (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ]
    [, CONSTRAINT constraint_name CHECK [check_expression] ENABLE|DISABLE NOVALIDATE RELY/NORELY ]

这个建表语句看起来非常复杂,前半部分是建表语句,后半部分是对数据类型和约束的描述。刚开始学可以不用关注太多复杂的东西,就把SQL里经常用的拿过来做试验即可。

管理表和外部表

管理表是Hive默认创建的表,外部表在Create后加上EXTERNAL关键字。它们的区别是,对于管理表,既管元数据,又管数据;对于外部表,只管元数据,不管数据。

 

Managed Tables 

 External Tables

 ARCHIVE/UNARCHIVE

 支持

 不支持

 DROP

 删除数据和元数据

默认不删除数据,只删除元数据

 ACID/Transactional

 支持

 不支持

 Query Results Caching

 支持 

 不支持

 RELY constraint

 支持

 支持

分区表

为了减少查询数据的量,可以按照查询条件对表做分区。分区表在HDFS的存储形式是按照分区字段创建存储目录,不同分区的数据放在不同的目录下。分区表在物理上按照分区字段独立开来。分区表有静态分区和动态分区两种形式,静态分区是人为给分区字段赋固定值,动态分区是根据分区字段的值自动分区。

 

删表语句
DROP TABLE [IF EXISTS] table_name [PURGE];
如果外部表设置了 external.table.purge=true,那么DROP的时候也会删除数据。

除了上面提到的库定义和表定义,DDL里还包含索引定义、视图定义、宏定义、函数定义和角色权限定义,请参考官网说明。

 

HIVE DML

Loading files into tables

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)] [INPUTFORMAT 'inputformat' SERDE 'serde'] (3.0 or later)
HIVE导入数据文件的时候,并没有做任何转换。导入文件到管理表,只是把文件复制到HDFS的存储目录下。导入文件到外部表,连文件复制的操作都没有。

--例子
CREATE TABLE tab1 (col1 int, col2 int) PARTITIONED BY (col3 int) STORED AS ORC;
LOAD DATA LOCAL INPATH 'filepath' INTO TABLE tab1;

 

Inserting data into Hive Tables from queries

Standard syntax:
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;
这里除了标准语法,还有HIVE的扩展语法,对于初学者来说,掌握标准语法就行。

--例子
INSERT INTO TABLE page_view PARTITION(dt='2008-06-08', country)
       SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.cnt FROM page_view_stg pvs

 

Inserting values into tables from SQL

Standard Syntax:
INSERT INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] ...)] VALUES values_row [, values_row ...]

--例子
CREATE TABLE students (name VARCHAR(64), age INT, gpa DECIMAL(3, 2));
 
INSERT INTO TABLE students
  VALUES ('fred flintstone', 35, 1.28), ('barney rubble', 32, 2.32);

 

Update

Standard Syntax:
UPDATE tablename SET column = value [, column = value ...] [WHERE expression]
这个语法跟普通SQL没什么两样

 

Delete

Standard Syntax:
DELETE FROM tablename [WHERE expression]
不要太简单

由于HIVE的数据存放在HDFS中,HDFS不支持数据的修改和删除。因此,在使用hive的过程中,一般不会产生删除和更新的操作,如果需要支持这些操作,必须修改hive-site.xml

//在hive的hive-site.xml中添加如下配置:
    <property>
        <name>hive.support.concurrency</name>
        <value>true</value>
    </property>
    <property>
        <name>hive.enforce.bucketing</name>
        <value>true</value>
    </property>
    <property>
        <name>hive.exec.dynamic.partition.mode</name>
        <value>nonstrict</value>
    </property>
    <property>
        <name>hive.txn.manager</name>
        <value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
    </property>
    <property>
        <name>hive.compactor.initiator.on</name>
        <value>true</value>
    </property>
    <property>
        <name>hive.compactor.worker.threads</name>
        <value>1</value>
    </property>