一、Hive中“一次使用的命令”
1. -S表示静默模式,结果去掉OK和Time taken等行
hive -e "select * from movie_table limit 3"
...
OK
movieId title genres
1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
2 Jumanji (1995) Adventure|Children|Fantasy
Time taken: 11.631 seconds, Fetched: 3 row(s)
[root@master hive-1.2.2]# hive -S -e "select * from movie_table limit 3"
...
movieId title genres
1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
2 Jumanji (1995) Adventure|Children|Fantasy
2.将查询结果输出到本地文件test.txt中(非HDFS) test.txt是程序执行时自己创建的,无需提前创建
[root@master hive-1.2.2]# hive -S -e "select * from movie_table limit 3" > /usr/local/src/test3/hive/test.txt
(py27tf) [root@master hive-1.2.2]# cat /usr/local/src/test3/hive/test.txt
movieId title genres
1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
2 Jumanji (1995) Adventure|Children|Fantasy
通过以下命令查询管理表的warehouse属性记录(我也还没搞懂,再看看书P35)
[root@master hive-1.2.2]# hive -e "seT" | grep warehouse
hive.metastore.warehouse.dir=/user/hive/warehouse
hive.warehouse.subdir.inherit.perms=true
3.文件执行hive查询 shell:source 终端使用hive -f 得到的结果相同
hive> source /usr/local/src/test3/select.sql;
OK
movieId title genres
1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
2 Jumanji (1995) Adventure|Children|Fantasy
Time taken: 2.976 seconds, Fetched: 3 row(s)
[root@master test3]# hive -f /usr/local/src/test3/select.sql
...
OK
movieId title genres
1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
2 Jumanji (1995) Adventure|Children|Fantasy
Time taken: 5.915 seconds, Fetched: 3 row(s)
表里写入数据
1.建表
create table test(line string);
OK
Time taken: 0.82 seconds
2.插数据
[root@master hive-1.2.2]# hive -e "load data local inpath '/usr/local/src/test3/test.txt' into table test"
或者
hive> load data local inpath '/usr/local/src/test3/test.txt' overwrite into table test;
3.查询
hive> select * from test;
OK
one row
Time taken: 0.4 seconds, Fetched: 1 row(s)
执行shell命令
在命令前加上! 并以;结尾
hive> ! pwd;
/usr/local/src/hive-1.2.2
hive> ! echo "what up dog";
"what up dog"
在hive内使用hadoop命令 hadoop dfs -ls / 只需要去掉hadoop即可
在hive内使用hadoop命令的优点:hadoop中每次都会启动一个新的JVM实例,而hive在同一个进程中执行
hive> dfs -ls / ;
Found 8 items
drwxr-xr-x - root supergroup 0 2019-03-04 09:41 /7
-rw-r--r-- 3 root supergroup 632207 2018-12-13 14:01 /The_Man_of_Property.txt
-rw-r--r-- 3 root supergroup 698 2019-05-13 18:32 /a8a
drwxr-xr-x - root supergroup 0 2019-05-27 10:29 /hbase
drwxr-xr-x - root supergroup 0 2019-05-27 14:47 /hbase_test
drwxr-xr-x - root supergroup 0 2019-05-26 16:49 /hive
drwx-wx-wx - root supergroup 0 2019-05-26 10:18 /tmp
drwxr-xr-x - root supergroup 0 2019-05-26 10:43 /user
Hive脚本注释 使用--开头的字符串表示注释
Tips:CLI不能解析注释,会产生报错信息,只能在脚本中通过hive -f **.hql 的方式执行
--copyright (c) 2012 Megacorp,LLC.
--This is the best Hive script evar!!
select * from table;
显示字段名称(默认是关闭的): (可在$HOME/.HIVRRC文件中配置修改为默认开启:set hive.cli.print.header=true;)
字段名称为:movie_table.movieid movie_table.title movie_table.genres
hive> select * from movie_table limit 2;
OK
movieId title genres
1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
Time taken: 0.159 seconds, Fetched: 2 row(s)
hive> set hive.cli.print.header=true;
hive> select * from movie_table limit 2;
OK
movie_table.movieid movie_table.title movie_table.genres
movieId title genres
1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
Time taken: 0.148 seconds, Fetched: 2 row(s)
Chapter 4 HiveQL:数据定义(创建、修改、删除数据库、表、视图、函数和索引)
1.Hive不支持行级插入操作、更新操作和删除操作。Hive也不支持事物。
2.Hive中数据库的概念本质上仅仅只是表的一个目录或者命名空间
3.创建数据库financials
hive> create database financials;
OK
Time taken: 0.114 seconds
hive> show databases;
OK
default
financials
Time taken: 0.036 seconds, Fetched: 2 row(s)
如果数据库financials已经存在,创建时就会报错,以下命令可以不抛出错误信息;
hive> create database if not exists financials;
用正则表达式匹配筛选数据库名(列举出所有f开头的数据库)
hive> show databases like 'f*';
OK
financials
Time taken: 0.051 seconds, Fetched: 1 row(s)
Hive为每一个数据库创建一个目录(目录名以*.db结尾),数据库中的表将会以这个数据库目录的子目录形式存储(default除外,因为default库本身没有自己的目录)
修改数据库目录:
1.配置文件修改
数据库的HDFS目录存储由hive-site.xml文件配置 : set hive.metastore.warehouse.dir=/user/hive/warehouse(这个目录是配置项默认配置,也可自己修改存储目录)
[root@master Programming_Hive]# hadoop fs -ls /user/hive/warehouse
Found 2 items
drwxr-xr-x - root supergroup 0 2019-05-28 10:49 /user/hive/warehouse/financials.db
drwxr-xr-x - root supergroup 0 2019-05-28 10:50 /user/hive/warehouse/human_resources.db
2.CLI修改数据库目录:(只修改当前数据库位置) 还可以在CLI增加数据库的描述
hive> create database change22
> comment 'holds all financial tables'
> location '/hive_test';
OK
Time taken: 0.043 seconds
hive> desc database change22;
OK
change22 holds all financial tables hdfs://master:9000/hive_test root USER
Time taken: 0.038 seconds, Fetched: 1 row(s)
为数据库增加何其相关的键—值对属性信息,
查询时使用语句:desc database extended change;
hive> create database change
> with dbproperties ("creator" = "Jason Chan","data"="2019-05-28");
OK
Time taken: 0.109 seconds
hive> desc database extended change;
OK
change hdfs://master:9000/user/hive/warehouse/change.db root USER {creator=Jason Chan, data=2019-05-28}
Time taken: 0.02 seconds, Fetched: 1 row(s)
修改或者新增数据库属性(不能删除)
hive> desc database extended change;
change hdfs://master:9000/user/hive/warehouse/change.db root USER {creator=Jason Chan, data=2019-05-28}
hive> alter database change set dbproperties ("creator"="jason");
hive> desc database extended change;
change hdfs://master:9000/user/hive/warehouse/change.db root USER {creator=jason, data=2019-05-28}
hive> alter database change set dbproperties ("edited-by"="Joe");
hive> desc database extended change;
change hdfs://master:9000/user/hive/warehouse/change.db root USER {creator=jason, data=2019-05-28, edited-by=Joe}
使用某个数据库:use
hive> show databases;
OK
default
financials
Time taken: 0.033 seconds, Fetched: 2 row(s)
hive> use financials;
OK
Time taken: 0.065 seconds
查询当前使用的数据库(当前的表属于哪个数据库):set hive.cli.print.current.db=true;
hive> set hive.cli.print.current.db=true;
hive (financials)> use default;
OK
Time taken: 0.032 seconds
hive (default)> set hive.cli.print.current.db=false;
删除数据库:drop
hive> drop database if exists human_resources;
OK
Time taken: 0.078 seconds
Hive不允许删除有表的数据库
hive> use traffic;
OK
Time taken: 0.039 seconds
//traffic数据库下面有表
hive> show tables;
OK
monitor_camera_info
monitor_flow_action
Time taken: 0.05 seconds, Fetched: 2 row(s)
//删除traffic数据库会报错
hive> drop database traffic;
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. InvalidOperationException(message:Database traffic is not empty. One or more tables exist.)
解决方案:
1.先清空数据库下的表:
//删除表
hive> drop table monitor_camera_info;
OK
Time taken: 0.132 seconds
hive> drop table monitor_flow_action;
OK
Time taken: 0.236 seconds
//此时再删除数据库就可以了
hive> drop database traffic;
OK
Time taken: 0.108 seconds
2.使用关键字: cascade
//数据库traffic下有表
hive> show tables;
OK
monitor_camera_info
monitor_flow_action
Time taken: 0.06 seconds, Fetched: 2 row(s)
//正常删库报错
hive> drop database if exists traffic;
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. InvalidOperationException(message:Database traffic is not empty. One or more tables exist.)
//使用关键字restrict也不能删除数据库
hive> drop database if exists traffic restrict;
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. InvalidOperationException(message:Database traffic is not empty. One or more tables exist.)
//使用关键字cascade可以正常删除
hive> drop database if exists traffic cascade;
OK
Time taken: 0.25 seconds
创建表(change数据库下创建employees表),在每个字段类型后面追加一个注释
TBLPROPERTIES作用:按键值对的格式为表增加额外的文档说明 (还可作为表示关于数据库连接的必要的元数据信息) P53
Hive会自动增加两个属性:
1.last_modified_by,保存最后修改这个表的用户的用户名
2.last_modified_time,保存着最后一次修改的新纪元时间秒
//建表sql语句
create table if not exists change.employees3 (
name string comment "employee name",
salary float comment "employee salary",
subordinates array<string> comment "names of subordinates",
deductions map<string,float>
comment "keys are deductions names, values are percentages",
address struct<street:string,city:string,state:string,zip:int>
comment "home address")
comment "description of the table"
TBLPROPERTIES ('creator'='me','created_at'='2019-05-28')
;
//查看结果
hive> desc employees3;
OK
name string employee name
salary float employee salary
subordinates array<string> names of subordinates
deductions map<string,float> keys are deductions names, values are percentages
address struct<street:string,city:string,state:string,zip:int> home address
Time taken: 0.116 seconds, Fetched: 5 row(s)
为表本身添加注释,自定义一个或多个表属性
怎么查看表的描述:comment "description of the table" ??? show tblproperties employees3;
hive> show tblproperties employees3;
OK
comment description of the table
created_at 2019-05-28
creator me
transient_lastDdlTime 1559027367
Time taken: 0.128 seconds, Fetched: 4 row(s)
用户还可以拷贝一张已经存在的表的模式(不拷贝数据): 在change数据库下创建表employees4(要求表模式与employees3相同)
创建employee4.3_2.sql文件输入下面语句,然后执行hive -f employee4.3_2.sql
[root@master Programming_Hive]# cat employee4.3_2.sql
create table if not exists change.employees4
like change.employees3;
[root@master Programming_Hive]# hive -f employee4.3_2.sql
查看表employees4的属性
hive> show tables;
OK
employees3
employees4
Time taken: 0.081 seconds, Fetched: 2 row(s)
hive> desc employees4;
OK
name string employee name
salary float employee salary
subordinates array<string> names of subordinates
deductions map<string,float> keys are deductions names, values are percentages
address struct<street:string,city:string,state:string,zip:int> home address
Time taken: 0.136 seconds, Fetched: 5 row(s)
在default数据库下查看change数据库的表 show tables in change;
hive> set hive.cli.print.current.db=true;
hive (default)> show tables in change;
OK
employees3
employees4
Time taken: 0.051 seconds, Fetched: 2 row(s)
hive (default)> set hive.cli.print.current.db=false;
hive>
使用正则表达式查询表: show tables 'empl*'; 查询empl开头的所有的表
hive> show tables "empl*";
OK
employees3
employees4
Time taken: 0.034 seconds, Fetched: 2 row(s)
在数据库default下查找数据库change的表的结构信息。desc extended change.employees3; 实际我们更倾向于使用FORMATTED,因为输出内容更详细,且可读性较好
hive (default)> desc extended change.employees3;
OK
name string employee name
salary float employee salary
subordinates array<string> names of subordinates
deductions map<string,float> keys are deductions names, values are percentages
address struct<street:string,city:string,state:string,zip:int> home address
Detailed Table Information Table(tableName:employees3, dbName:change, owner:root
...
location:hdfs://master:9000/user/hive/warehouse/change.db/employees3,
...
parameters:{creator=me, transient_lastDdlTime=1559027367, created_at=2019-05-28, comment=description of the table}, ...)
Time taken: 0.091 seconds, Fetched: 7 row(s)
查看表的某一列信息:salary列 (加不加extended结果是一样的)
hive (change)> describe employees3.salary;
OK
salary float from deserializer
Time taken: 0.117 seconds, Fetched: 1 row(s)
4.3.2外部表
创建外部表,读取HDFS目录:/hive/programming_hive/data/stocks下所有的文件
删除外表,并不会删掉这个表中的数据,只是删除描述表的元数据
[root@master Programming_Hive]# cat 4.3.2stock.sql create external table if not exists stocks (
`exchange` string,
`symbol` string,
`ymd` string,
`price_open` float,
`price_high` float,
`price_low` float,
`price_close` float,
`volume` int,
`price_adj_close` float)
row format delimited fields terminated by ','
location '/hive/programming_hive/data/stocks';
exchange 字段为hive保留字段,修改之后即可。如果非要用这个字段名,就用反引号(Tab上面的键)引起来:
严格来说:Hive是管理着管理表和外部表的的目录和文件,但是并没有对表具有完全的控制权限
查看表是管理表还是外部表:desc formatted movie_table
hive (default)> desc FORMATTED movie_table;
...
Table Type: EXTERNAL_TABLE
...
hive (default)> desc FORMATTED jason;
...
Table Type: MANAGED_TABLE
...
创建外部表,但是可以复制内部表的结构(不复制数据) employees5外部表,employees3内部表
如果employees为外部表,external可省略,复制的表依然为外部表
[root@master Programming_Hive]# cat 4.3.2external_table.sql
create table if not exists change.employees5
like change.employees3
location '/hive/programming_hive/test';
[root@master Programming_Hive]# hive -f 4.3.2external_table.sql
...
OK
Time taken: 1.939 seconds
查看结构employees3和employees5一样,但employees5表的类型为external
hive (change)> desc formatted employees5;
OK
# col_name data_type comment
name string employee name
salary float employee salary
subordinates array<string> names of subordinates
deductions map<string,float> keys are deductions names, values are percentages
address struct<street:string,city:string,state:string,zip:int> home address
...
Table Type: EXTERNAL_TABLE
4.4 分区表、管理表
如果表中的数据以及分区个数都非常大的话,执行一个包含所有分区的查询可能会触发一个巨大的MapReduce任务。
建议:将Hive设置为strict模式,(对分区表进行查询而where子句没有加分区过滤器,将会禁止提交这个任务)
hive> set hive.mapred.mode=strict;
hive> select e.name,e.salary from employees e limit 10;
FAILED:Error in semantic analysis: No partition predicate found for Alias "e" Table "employees"
hive> set hive.mapred.mode=nonstrict;
hive> select e.name,e.salary from employees e limit 10;
John Doe 10000.0
...
查看分区:show partitions
hive (default)> show partitions rating_table_p;
OK
dt=2008-03
dt=2008-08
Time taken: 0.136 seconds, Fetched: 2 row(s)
查看分区键的方法:desc extended rating_table_p; partitionKeys分区键
hive (default)> desc extended rating_table_p;
OK
userid string
movieid string
rating string
dt string
# Partition Information
# col_name data_type comment
dt string
partitionKeys:[FieldSchema(name:dt, type:string, comment:null)]
分区:
1.建表的时候创建分区
2.加载数据的时候创建分区
4.4.1