hive编程指南读后感1000 hive 编程指南

转载

jowvid 2023-08-11 16:45:26

文章标签 hive编程指南读后感1000 hive Time 数据库 文章分类 Hive 大数据

一、Hive中“一次使用的命令”

1. -S表示静默模式，结果去掉OK和Time taken等行

hive -e "select * from movie_table limit 3"
...
OK
movieId title   genres
1       Toy Story (1995)        Adventure|Animation|Children|Comedy|Fantasy
2       Jumanji (1995)  Adventure|Children|Fantasy
Time taken: 11.631 seconds, Fetched: 3 row(s)


[root@master hive-1.2.2]# hive -S -e "select * from movie_table limit 3" 
...
movieId title   genres
1       Toy Story (1995)        Adventure|Animation|Children|Comedy|Fantasy
2       Jumanji (1995)  Adventure|Children|Fantasy

2.将查询结果输出到本地文件test.txt中（非HDFS） test.txt是程序执行时自己创建的，无需提前创建

[root@master hive-1.2.2]# hive -S -e "select * from movie_table limit 3" > /usr/local/src/test3/hive/test.txt

(py27tf) [root@master hive-1.2.2]# cat /usr/local/src/test3/hive/test.txt
movieId title   genres
1       Toy Story (1995)        Adventure|Animation|Children|Comedy|Fantasy
2       Jumanji (1995)  Adventure|Children|Fantasy

通过以下命令查询管理表的warehouse属性记录（我也还没搞懂，再看看书P35）

[root@master hive-1.2.2]# hive  -e "seT" | grep warehouse

hive.metastore.warehouse.dir=/user/hive/warehouse
hive.warehouse.subdir.inherit.perms=true

3.文件执行hive查询 shell：source 终端使用hive -f 得到的结果相同

hive>  source /usr/local/src/test3/select.sql;
OK
movieId title   genres
1       Toy Story (1995)        Adventure|Animation|Children|Comedy|Fantasy
2       Jumanji (1995)  Adventure|Children|Fantasy
Time taken: 2.976 seconds, Fetched: 3 row(s)


[root@master test3]# hive -f /usr/local/src/test3/select.sql 
...
OK
movieId title   genres
1       Toy Story (1995)        Adventure|Animation|Children|Comedy|Fantasy
2       Jumanji (1995)  Adventure|Children|Fantasy
Time taken: 5.915 seconds, Fetched: 3 row(s)

表里写入数据

1.建表
create table test(line string);
OK
Time taken: 0.82 seconds
2.插数据
[root@master hive-1.2.2]# hive -e "load data local inpath '/usr/local/src/test3/test.txt' into table test"
或者
hive> load data local inpath '/usr/local/src/test3/test.txt' overwrite into table test;

3.查询
hive> select * from test;
OK
one row
Time taken: 0.4 seconds, Fetched: 1 row(s)

执行shell命令

在命令前加上！并以；结尾

hive> ! pwd;
/usr/local/src/hive-1.2.2
hive> ! echo "what up dog";
"what up dog"

在hive内使用hadoop命令 hadoop dfs -ls / 只需要去掉hadoop即可

在hive内使用hadoop命令的优点：hadoop中每次都会启动一个新的JVM实例，而hive在同一个进程中执行

hive> dfs -ls / ;
Found 8 items
drwxr-xr-x   - root supergroup          0 2019-03-04 09:41 /7
-rw-r--r--   3 root supergroup     632207 2018-12-13 14:01 /The_Man_of_Property.txt
-rw-r--r--   3 root supergroup        698 2019-05-13 18:32 /a8a
drwxr-xr-x   - root supergroup          0 2019-05-27 10:29 /hbase
drwxr-xr-x   - root supergroup          0 2019-05-27 14:47 /hbase_test
drwxr-xr-x   - root supergroup          0 2019-05-26 16:49 /hive
drwx-wx-wx   - root supergroup          0 2019-05-26 10:18 /tmp
drwxr-xr-x   - root supergroup          0 2019-05-26 10:43 /user

Hive脚本注释使用--开头的字符串表示注释

Tips：CLI不能解析注释，会产生报错信息，只能在脚本中通过hive -f **.hql 的方式执行

--copyright (c) 2012 Megacorp,LLC.
--This is the best Hive script evar!!

select * from table;

显示字段名称（默认是关闭的）：（可在$HOME/.HIVRRC文件中配置修改为默认开启：set hive.cli.print.header=true;）

字段名称为：movie_table.movieid movie_table.title movie_table.genres

hive> select * from movie_table limit 2;
OK
movieId title   genres
1       Toy Story (1995)        Adventure|Animation|Children|Comedy|Fantasy
Time taken: 0.159 seconds, Fetched: 2 row(s)

hive> set hive.cli.print.header=true;

hive> select * from movie_table limit 2;
OK
movie_table.movieid     movie_table.title       movie_table.genres
movieId title   genres
1       Toy Story (1995)        Adventure|Animation|Children|Comedy|Fantasy
Time taken: 0.148 seconds, Fetched: 2 row(s)

Chapter 4 HiveQL：数据定义（创建、修改、删除数据库、表、视图、函数和索引）

1.Hive不支持行级插入操作、更新操作和删除操作。Hive也不支持事物。

2.Hive中数据库的概念本质上仅仅只是表的一个目录或者命名空间

3.创建数据库financials

hive> create database financials;
OK
Time taken: 0.114 seconds

hive> show databases;
OK
default
financials
Time taken: 0.036 seconds, Fetched: 2 row(s)

如果数据库financials已经存在，创建时就会报错，以下命令可以不抛出错误信息；

hive> create database if not exists financials;

用正则表达式匹配筛选数据库名（列举出所有f开头的数据库）

hive> show databases like 'f*';
OK
financials
Time taken: 0.051 seconds, Fetched: 1 row(s)

Hive为每一个数据库创建一个目录（目录名以*.db结尾），数据库中的表将会以这个数据库目录的子目录形式存储（default除外，因为default库本身没有自己的目录）

修改数据库目录：

1.配置文件修改

数据库的HDFS目录存储由hive-site.xml文件配置： set hive.metastore.warehouse.dir=/user/hive/warehouse（这个目录是配置项默认配置，也可自己修改存储目录）

[root@master Programming_Hive]# hadoop fs -ls /user/hive/warehouse
Found 2 items
drwxr-xr-x   - root supergroup          0 2019-05-28 10:49 /user/hive/warehouse/financials.db
drwxr-xr-x   - root supergroup          0 2019-05-28 10:50 /user/hive/warehouse/human_resources.db

2.CLI修改数据库目录：（只修改当前数据库位置）还可以在CLI增加数据库的描述

hive> create database change22
    > comment 'holds all financial tables'
    > location '/hive_test';
OK
Time taken: 0.043 seconds

hive> desc database change22;
OK
change22        holds all financial tables      hdfs://master:9000/hive_test    root    USER
Time taken: 0.038 seconds, Fetched: 1 row(s)

为数据库增加何其相关的键—值对属性信息，

查询时使用语句：desc database extended change;

hive> create database change
    > with dbproperties ("creator" = "Jason Chan","data"="2019-05-28");
OK
Time taken: 0.109 seconds

hive> desc database extended change;
OK
change          hdfs://master:9000/user/hive/warehouse/change.db        root    USER    {creator=Jason Chan, data=2019-05-28}
Time taken: 0.02 seconds, Fetched: 1 row(s)

修改或者新增数据库属性（不能删除）

hive> desc database extended change;
change          hdfs://master:9000/user/hive/warehouse/change.db        root    USER    {creator=Jason Chan, data=2019-05-28}

hive> alter database change set dbproperties ("creator"="jason");

hive> desc database extended change;
change          hdfs://master:9000/user/hive/warehouse/change.db        root    USER    {creator=jason, data=2019-05-28}

hive> alter database change set dbproperties ("edited-by"="Joe");

hive> desc database extended change;
change          hdfs://master:9000/user/hive/warehouse/change.db        root    USER    {creator=jason, data=2019-05-28, edited-by=Joe}

使用某个数据库：use

hive> show databases;
OK
default
financials
Time taken: 0.033 seconds, Fetched: 2 row(s)

hive> use financials;
OK
Time taken: 0.065 seconds

查询当前使用的数据库（当前的表属于哪个数据库）：set hive.cli.print.current.db=true;

hive> set hive.cli.print.current.db=true;

hive (financials)> use default;
OK
Time taken: 0.032 seconds

hive (default)> set hive.cli.print.current.db=false;

删除数据库：drop

hive> drop database if exists human_resources;
OK
Time taken: 0.078 seconds

Hive不允许删除有表的数据库

hive> use traffic;
OK
Time taken: 0.039 seconds
    //traffic数据库下面有表
hive> show tables;
OK
monitor_camera_info
monitor_flow_action
Time taken: 0.05 seconds, Fetched: 2 row(s)
    //删除traffic数据库会报错
hive> drop database traffic;
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. InvalidOperationException(message:Database traffic is not empty. One or more tables exist.)

解决方案：

1.先清空数据库下的表：

//删除表
hive> drop table monitor_camera_info;
OK
Time taken: 0.132 seconds
hive> drop table monitor_flow_action;
OK
Time taken: 0.236 seconds
    //此时再删除数据库就可以了
hive> drop database traffic;
OK
Time taken: 0.108 seconds

2.使用关键字： cascade

//数据库traffic下有表
hive> show tables;
OK
monitor_camera_info
monitor_flow_action
Time taken: 0.06 seconds, Fetched: 2 row(s)

    //正常删库报错
hive> drop database if exists traffic; 
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. InvalidOperationException(message:Database traffic is not empty. One or more tables exist.)

    //使用关键字restrict也不能删除数据库    
hive> drop database if exists traffic restrict; 
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. InvalidOperationException(message:Database traffic is not empty. One or more tables exist.)

    //使用关键字cascade可以正常删除
hive> drop database if exists traffic cascade; 
OK
Time taken: 0.25 seconds

创建表（change数据库下创建employees表），在每个字段类型后面追加一个注释

TBLPROPERTIES作用：按键值对的格式为表增加额外的文档说明（还可作为表示关于数据库连接的必要的元数据信息） P53

Hive会自动增加两个属性：

1.last_modified_by，保存最后修改这个表的用户的用户名

2.last_modified_time，保存着最后一次修改的新纪元时间秒

//建表sql语句
create table if not exists change.employees3 (
    name            string  comment "employee name",

    salary          float   comment "employee salary",

    subordinates    array<string>   comment "names of subordinates",

    deductions      map<string,float>
                    comment "keys are deductions names, values are percentages",

    address         struct<street:string,city:string,state:string,zip:int>
                    comment "home address")
            
comment "description of the table"
TBLPROPERTIES ('creator'='me','created_at'='2019-05-28')
;


    //查看结果

hive> desc  employees3;
OK
name                    string                  employee name       
salary                  float                   employee salary     
subordinates            array<string>           names of subordinates
deductions              map<string,float>       keys are deductions names, values are percentages
address                 struct<street:string,city:string,state:string,zip:int>  home address        
Time taken: 0.116 seconds, Fetched: 5 row(s)

为表本身添加注释，自定义一个或多个表属性

怎么查看表的描述：comment "description of the table" ？？？ show tblproperties employees3;

hive> show tblproperties employees3;
OK
comment description of the table
created_at      2019-05-28
creator me
transient_lastDdlTime   1559027367
Time taken: 0.128 seconds, Fetched: 4 row(s)

用户还可以拷贝一张已经存在的表的模式（不拷贝数据）：在change数据库下创建表employees4（要求表模式与employees3相同）

创建employee4.3_2.sql文件输入下面语句，然后执行hive -f employee4.3_2.sql

[root@master Programming_Hive]# cat employee4.3_2.sql 
create table if not exists change.employees4
like change.employees3;

[root@master Programming_Hive]# hive  -f  employee4.3_2.sql

查看表employees4的属性

hive> show tables;
OK
employees3
employees4
Time taken: 0.081 seconds, Fetched: 2 row(s)
hive> desc employees4;
OK
name                    string                  employee name       
salary                  float                   employee salary     
subordinates            array<string>           names of subordinates
deductions              map<string,float>       keys are deductions names, values are percentages
address                 struct<street:string,city:string,state:string,zip:int>  home address        
Time taken: 0.136 seconds, Fetched: 5 row(s)

在default数据库下查看change数据库的表 show tables in change;

hive> set hive.cli.print.current.db=true;
hive (default)> show tables in change;
OK
employees3
employees4
Time taken: 0.051 seconds, Fetched: 2 row(s)
hive (default)> set hive.cli.print.current.db=false;
hive>

使用正则表达式查询表： show tables 'empl*'; 查询empl开头的所有的表

hive> show tables "empl*";
OK
employees3
employees4
Time taken: 0.034 seconds, Fetched: 2 row(s)

在数据库default下查找数据库change的表的结构信息。desc extended change.employees3; 实际我们更倾向于使用FORMATTED，因为输出内容更详细，且可读性较好

hive (default)> desc extended change.employees3;
OK
name                    string                  employee name       
salary                  float                   employee salary     
subordinates            array<string>           names of subordinates
deductions              map<string,float>       keys are deductions names, values are percentages
address                 struct<street:string,city:string,state:string,zip:int>  home address        
                 
Detailed Table Information      Table(tableName:employees3, dbName:change, owner:root
...
location:hdfs://master:9000/user/hive/warehouse/change.db/employees3,  
...
parameters:{creator=me, transient_lastDdlTime=1559027367, created_at=2019-05-28, comment=description of the table}, ...)
Time taken: 0.091 seconds, Fetched: 7 row(s)

查看表的某一列信息：salary列（加不加extended结果是一样的）

hive (change)> describe employees3.salary;
OK
salary                  float                   from deserializer   
Time taken: 0.117 seconds, Fetched: 1 row(s)

4.3.2外部表

创建外部表，读取HDFS目录：/hive/programming_hive/data/stocks下所有的文件

删除外表，并不会删掉这个表中的数据，只是删除描述表的元数据

[root@master Programming_Hive]# cat 4.3.2stock.sql                                              create external table if not exists stocks (
`exchange` string,
`symbol` string,
`ymd` string,
`price_open` float,
`price_high` float,
`price_low` float,
`price_close` float,
`volume` int,
`price_adj_close` float)
row format delimited fields terminated by ','
location '/hive/programming_hive/data/stocks';

exchange 字段为hive保留字段，修改之后即可。如果非要用这个字段名，就用反引号（Tab上面的键）引起来：

严格来说：Hive是管理着管理表和外部表的的目录和文件，但是并没有对表具有完全的控制权限

查看表是管理表还是外部表：desc formatted movie_table

hive (default)> desc FORMATTED movie_table;
...  
Table Type:             EXTERNAL_TABLE                    
...    
hive (default)> desc FORMATTED jason;
...    
Table Type:             MANAGED_TABLE            
...

创建外部表，但是可以复制内部表的结构(不复制数据) employees5外部表，employees3内部表

如果employees为外部表，external可省略，复制的表依然为外部表

[root@master Programming_Hive]# cat 4.3.2external_table.sql 
create table if not exists change.employees5 
like change.employees3
location '/hive/programming_hive/test';

[root@master Programming_Hive]# hive -f 4.3.2external_table.sql 
...
OK
Time taken: 1.939 seconds

查看结构employees3和employees5一样，但employees5表的类型为external

hive (change)> desc formatted employees5;
OK
# col_name              data_type               comment             
                 
name                    string                  employee name       
salary                  float                   employee salary     
subordinates            array<string>           names of subordinates
deductions              map<string,float>       keys are deductions names, values are percentages
address                 struct<street:string,city:string,state:string,zip:int>  home address        
...                   
Table Type:             EXTERNAL_TABLE

4.4 分区表、管理表

如果表中的数据以及分区个数都非常大的话，执行一个包含所有分区的查询可能会触发一个巨大的MapReduce任务。

建议：将Hive设置为strict模式，（对分区表进行查询而where子句没有加分区过滤器，将会禁止提交这个任务）

hive> set hive.mapred.mode=strict;

hive> select e.name,e.salary from employees e limit 10;

FAILED:Error in semantic analysis: No partition predicate found for Alias  "e" Table  "employees"

hive> set hive.mapred.mode=nonstrict;

hive> select e.name,e.salary from employees e limit 10;

John  Doe    10000.0  

...

查看分区：show partitions

hive (default)> show partitions rating_table_p;
OK
dt=2008-03
dt=2008-08
Time taken: 0.136 seconds, Fetched: 2 row(s)

查看分区键的方法：desc extended rating_table_p; partitionKeys分区键

hive (default)> desc extended rating_table_p;
OK
userid                  string                                      
movieid                 string                                      
rating                  string                                      
dt                      string                                      
                 
# Partition Information          
# col_name              data_type               comment             
                 
dt                      string                                      
                 
partitionKeys:[FieldSchema(name:dt, type:string, comment:null)]

分区：

1.建表的时候创建分区

2.加载数据的时候创建分区

4.4.1