1.创建数据库
CREATE DATABASE [IF NOT EXISTS] database_name
[COMMENT database_comment]
[LOCATION hdfs_path];
注:Impala不支持WITH DBPROPERTIE…语法,但是在Hive中可以
[cdh2:21000] > create database db_hive WITH DBPROPERTIES('name' = 'Andy');
Query: create database db_hive
WITH DBPROPERTIES('name' = 'ttt')
ERROR: AnalysisException: Syntax error in line 2:
WITH DBPROPERTIES('name' = 'ttt')
Encountered: WITH
Expected: COMMENT,
LOCATION
2.查询数据库
2.1 显示数据库
[cdh2:21000] > show databases;
[cdh2:21000] > show databases like 'hive*';
Query: show databases like 'hive*'
+---------+---------+
| name | comment |
+---------+---------+
| hive_db | |
+---------+---------+
[cdh2:21000] > desc database hive_db;
Query: describe database hive_db
+---------+----------+---------+
| name | location | comment |
+---------+----------+---------+
| hive_db | | |
+---------+----------+---------+
2.2 删除数据库
[cdh2:21000] > drop database hive_db;
[cdh2:21000] > drop database hive_db cascade;
注:
Impala不支持alter database语法
当数据库被 USE 语句选中时,无法删除
3.创建表
3.1 管理表
[cdh2:21000] >
create table if not exists student2(
id int, name string
)
row format delimited fields terminated by '\t'
stored as textfile
location '/user/hive/warehouse/student2';
[cdh2:21000] > desc formatted student2;
3.2 外部表
[cdh2:21000] >
create external table stu_external(
id int, name string)
row format delimited fields terminated by '\t' ;
4.分区表
4.1 创建分区表
[cdh2:21000] >
create table stu_par(id int, name string)
partitioned by (month string)
row format delimited
fields terminated by '\t';
4.2 向表中导入数据
[cdh2:21000] > alter table stu_par add partition (month='201810');
[cdh2:21000] > load data inpath '/student.txt' into table stu_par
partition(month='201910');
[cdh2:21000] > insert into table stu_par partition (month = '201811')
select * from student;
注意:
如果没有分区,load data导入数据时,不能自动创建分区。
4.3 查询分区表中的数据
[cdh2:21000] > select * from stu_par where month = '201811';
4.4 增加多个分区
[cdh2:21000] > alter table stu_par add partition (month='201812')
partition (month='201813');
4.5 删除分区
[cdh2:21000] > alter table stu_par drop partition (month='201812');
4.6 查看分区
[cdh2:21000] > show partitions stu_par;
5.创建视图
#创建视图
create view if not exists stu_view
as select name from student;
#展示视图
show tables;
#查询视图
select * from stu_view;
#更改视图
alter view stu_view as select id from student;
#删除视图
drop view stu_view;
6.常用SQL
6.1 insert 语句
#创建表
create table person(id int ,name string, age int);
#插入数据
insert into person values(1,'A',18);
insert into person values(1,'A_1',20);
insert into person values(2,'B',29);
insert into person values(3,'C',16);
insert into person values(4,'D',40);
6.2 ORDER BY 语句
select * from person order by age desc;
6.3 GROUP BY 语句
insert into person values(1,'A',21);
select name,sum(age) from person group by name;
6.4 Having 语句
select name,sum(age) from person
group by name having sum(age) >30;
6.5 Limit 语句
select * from person order by id limit 3;
6.6 offset 语句
select * from person order by id limit 3 offset 1;
6.7 union 语句
select * from stu_view union select name from person;
7.DML数据操作
7.1 数据导入(基本同hive类似)
注意:impala不支持load data local inpath…
7.2 数据的导出
(1)impala不支持insert overwrite…语法导出数据
(2)impala 数据导出一般使用 impala -o
[root@cdh2 ~]# impala-shell -q 'select * from student' -B
--output_delimiter="\t" -o output.txt
[root@cdh2 ~]# cat output.txt
1001 tignitgn
1002 yuanyuan
1003 haohao
1004 yunyun
Impala 不支持export和import命令
8.查询
(1)基本的语法跟hive的查询语句大体一样
(2)Impala不支持CLUSTER BY, DISTRIBUTE BY, SORT BY
(3)Impala中不支持分桶表
(4)Impala不支持COLLECT_SET(col)和explode(col)函数
(5)Impala支持开窗函数
[cdh2:21000] > select name,orderdate,cost,sum(cost)
over(partition by month(orderdate)) from business;
9.函数
9.1 自定义函数
(1)创建一个Maven工程Hive
(2)导入依赖
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.hive/hive-exec -->
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>1.2.1</version>
</dependency>
</dependencies>
(3)创建一个类
package com.itstar.hive;
import org.apache.hadoop.hive.ql.exec.UDF;
public class Lower extends UDF {
public String evaluate (final String s) {
if (s == null) {
return null;
}
return s.toLowerCase();
}
}
(4)打成jar包上传到服务器/opt/module/jars/udf.jar
(5)将jar包上传到hdfs的指定目录
hadoop fs -put hive_udf-0.0.1-SNAPSHOT.jar /
(6)创建函数
[cdh2:21000] > create function mylower(string) returns string location
'/Hive-1.0-SNAPSHOT.jar' symbol='Lower';
(7)使用自定义函数
[cdh2:21000] > select id,mylower(name) from student;
(8)通过show functions查看自定义的函数
[cdh2:21000] > show functions;
Query:show functions
+-------------+-----------------+-------------+---------------+
| return type | signature | binary type | is persistent |
+-------------+-----------------+-------------+---------------+
| STRING | mylower(STRING) | JAVA | false |
+-------------+-----------------+-------------+---------------+
10.存储和压缩
注:impala不支持ORC格式
(1)创建parquet格式的表并插入数据进行查询
[cdh2:21000] >
create table student3(id int, name string)
row format delimited
fields terminated by '\t'
stored as PARQUET;
[cdh2:21000] > insert into table student3 values(1001,'zhangsan');
[cdh2:21000] > select * from student3;
11.优化
(1)尽量将StateStore和Catalog单独部署到同一个节点,保证他们正常通行。
(2)通过对Impala Daemon内存限制(默认256M)及StateStore工作线程数,来提高Impala的执行效率。
(3)SQL优化,使用之前调用执行计划
(4)选择合适的文件格式进行存储,提高查询效率。
(5)避免产生很多小文件(如果有其他程序产生的小文件,可以使用中间表,将小文件数据存放到中间表。然后通过insert…select…方式中间表的数据插入到最终表中)
(6)使用合适的分区技术,根据分区粒度测算
(7)使用compute stats进行表信息搜集,当一个内容表或分区明显变化,重新计算统计相关数据表或分区。因为行和不同值的数量差异可能导致impala选择不同的连接顺序时,表中使用的查询。
[cdh2:21000] > compute stats student;
Query:compute stats student
+-----------------------------------------+
| summary |
+-----------------------------------------+
| Updated 1 partition(s) and 2 column(s). |
+-----------------------------------------+
(8)网络io的优化:
a.避免把整个数据发送到客户端
b.尽可能的做条件过滤
c.使用limit字句
d.输出文件时,避免使用美化输出
e.尽量少用全量元数据的刷新
(9)使用profile输出底层信息计划,在做相应环境优化