hive2.0版本整合hbase1.2.1

转载

fingercross 2022-04-13 13:34:27 博主文章分类：hive

文章标签 hive hadoop 数据 文章分类 代码人生

本文主要实现一下目标：

1. 在hive中创建的表能直接创建保存到hbase中。

2. hive中的表插入数据，插入的数据会同步更新到hbase对应的表中。

3. hbase对应的列簇值变更，也会在Hive中对应的表中变更。

4. 实现了多列，多列簇的转化：（示例：hive中3列对应hbase中2列簇）

hive与hbase的整合

1. 创建hbase识别的表：

CREATE TABLE hbase_table_1(key int, value string)    
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'  
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")   
TBLPROPERTIES ("hbase.table.name" = "xyz");

hbase.table.name 定义在hbase的table名称
hbase.columns.mapping 定义在hbase的列族

报错FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:org.apache.hadoop.hbase.client.RetriesExhaustedException: Can't get the locations

at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:312)

at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:153)

at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:61)

解决：是hive读取hbase的zookeeper时出错了（https://my.oschina.net/houxm/blog/639748）

hive-site.xml文件添加配置     
<property>
    <name>hbase.zookeeper.quorum</name>
    <value>hadoop01,hadoop02,hadoop03</value>
    <description>
    </description>
</property>

hbase中看到的表：

hbase(main):001:0> list
TABLE                                                                                                                      
xyz                                                                                                                
t1                                                                                                                         
2 row(s) in 0.7750 seconds

=> ["xyz", "t1"]

hbase里自动建好xyz

这个表

2.使用sql导入数据

i.预先准备数据

a)新建hive的数据表

hive> create table ccc(foo int,bar string) row format delimited fields terminated by '\t' lines terminated by '\n' stored as textfile;

vi kv1.txt

1       xiaoming
2       cainiao
3       zhangsan

这个文件位于/home/hadoop/目录下

load data local inpath '/home/hadoop/kv1.txt' overwrite into table ccc;

hive> select * from ccc;
OK
1       xiaoming
2       cainiao
3       zhangsan

使用sql导入hbase_table_1

hive> insert overwrite table hbase_table_1 select * from ccc where foo=1;

Query ID = hadoop_20170208120143_2d3a5cec-46d8-47ef-baeb-aa8ebaf5bc0e

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks is set to 0 since there's no reduce operator

Starting Job = job_1486525566658_0001, Tracking URL = http://hadoop06:8088/proxy/application_1486525566658_0001/

Kill Command = /home/hadoop/cloud/hadoop-2.7.3/bin/hadoop job -kill job_1486525566658_0001

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

2017-02-08 12:08:26,134 Stage-1 map = 0%, reduce = 0%

2017-02-08 12:09:26,916 Stage-1 map = 0%, reduce = 0%

2017-02-08 12:10:27,696 Stage-1 map = 0%, reduce = 0%

2017-02-08 12:11:28,559 Stage-1 map = 0%, reduce = 0%, Cumulative CPU 2.8 sec

2017-02-08 12:11:40,152 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.22 sec

MapReduce Total cumulative CPU time: 4 seconds 810 msec

Ended Job = job_1486525566658_0001

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Cumulative CPU: 4.81 sec HDFS Read: 10855 HDFS Write: 0 SUCCESS

Total MapReduce CPU Time Spent: 4 seconds 810 msec

查看数据

会显示刚刚插入的数据

hive> select * from hbase_table_1;
OK
1       xiaoming

hbase 登录hbase

查看加载的数据

hbase(main):002:0> scan 'xyz';
ROW                                           COLUMN+CELL                                                                                                                       
 1                                            column=cf1:val, timestamp=1486527116866, value=xiaoming

hbase添加数据:

hbase(main):004:0> put 'xyz','100','cf1:val','www.gongchang.com';
hbase(main):005:0> put 'xyz','200','cf1:val','hello,word!';
hbase(main):006:0> scan 'xyz';
ROW                                           COLUMN+CELL                                                                                                                       
 1                                            column=cf1:val, timestamp=1486527116866, value=xiaoming                                                                           
 100                                          column=cf1:val, timestamp=1486527821853, value=www.gongchang.com                                                                  
 200                                          column=cf1:val, timestamp=1486527821904, value=hello,word!                                                                        
3 row(s) in 0.0270 seconds

Hive

参看hive中的数据

hive> select * from hbase_table_1;
OK
1       xiaoming
100     www.gongchang.com
200     hello,word!

刚刚在hbase中插入的数据，已经在hive里了

hive访问已经存在的hbase

hbase中的元数据准备：

hbase(main):009:0> create 'student','info';
0 row(s) in 1.7530 seconds
hbase(main):014:0> put "student",'1','info:name','tom';
hbase(main):015:0* put "student",'2','info:name','lily';
hbase(main):016:0* put "student",'3','info:name','wwn';

hbase(main):018:0> scan 'student';
ROW                                           COLUMN+CELL                                                                                                                       
 1                                            column=info:name, timestamp=1486528334315, value=tom                                                                              
 2                                            column=info:name, timestamp=1486528334331, value=lily                                                                             
 3                                            column=info:name, timestamp=1486528334404, value=wwn                                                                              
3 row(s) in 0.0380 seconds

hive访问已经存在的hbase

hive新建表

CREATE EXTERNAL TABLE hbase_table_3(key int, value string)    
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'  
WITH SERDEPROPERTIES ("hbase.columns.mapping" = "info:name")   
TBLPROPERTIES("hbase.table.name" = "student");

hive> select * from hbase_table_3;
OK
1       tom
2       lily
3       wwn
Time taken: 0.439 seconds, Fetched: 3 row(s)

由上可以看出，hive已经能访问查看hbase中原有的数据了。

注意：如果hbase中列簇名name数据变更，那么hive中查询结果也会相应的变更，如果hbase中不是其他列簇

内容更新则hive中查询结果不显示。

三、多列和多列族（Multiple Columns and Families）

1．创建数据库

hive新建表

CREATE TABLE hbase_table_add1(key int, value1 string, value2 int, value3 int)    
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'  
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:col1,info:col2,city:nu")
TBLPROPERTIES("hbase.table.name" = "student_info");

通过参数hive.cli.print.header可以控制在cli中是否显示表的列名。

hive> set hive.cli.print.header=true;

hive> select * from hbase_table_3;
OK
hbase_table_3.key       hbase_table_3.value
1       tom
2       lily
3       wwn

hive执行操作

hive> insert overwrite table hbase_table_add1 select key,value,key+1,value from hbase_table_3;

Query ID = hadoop_20170208150313_7aba27e4-8c94-4432-b08a-e14af6e10b17

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks is set to 0 since there's no reduce operator

Starting Job = job_1486525566658_0002, Tracking URL = http://hadoop06:8088/proxy/application_1486525566658_0002/

Kill Command = /home/hadoop/cloud/hadoop-2.7.3/bin/hadoop job -kill job_1486525566658_0002

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

2017-02-08 15:09:28,333 Stage-1 map = 0%, reduce = 0%

2017-02-08 15:10:29,086 Stage-1 map = 0%, reduce = 0%

2017-02-08 15:11:29,331 Stage-1 map = 0%, reduce = 0%

2017-02-08 15:12:17,755 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 14.09 sec

MapReduce Total cumulative CPU time: 14 seconds 90 msec

Ended Job = job_1486525566658_0002

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Cumulative CPU: 14.09 sec HDFS Read: 5226 HDFS Write: 0 SUCCESS

Total MapReduce CPU Time Spent: 14 seconds 90 msec

_col0 _col1 _col2 _col3

Time taken: 571.258 seconds

hbase_table_add1.key    hbase_table_add1.value1 hbase_table_add1.value2 hbase_table_add1.value3
1       tom     2       NULL
2       lily    3       NULL
3       wwn     4       NULL
Time taken: 0.392 seconds, Fetched: 3 row(s)

hive> insert overwrite table hbase_table_add1 select key,value,key+1,key+100 from hbase_table_3;

Query ID = hadoop_20170208151919_f0513cd5-b742-479c-96ec-56f3c4b47d5e

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks is set to 0 since there's no reduce operator

Starting Job = job_1486525566658_0003, Tracking URL = http://hadoop06:8088/proxy/application_1486525566658_0003/

Kill Command = /home/hadoop/cloud/hadoop-2.7.3/bin/hadoop job -kill job_1486525566658_0003

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

2017-02-08 15:22:57,975 Stage-1 map = 0%, reduce = 0%

2017-02-08 15:23:58,599 Stage-1 map = 0%, reduce = 0%

2017-02-08 15:24:59,518 Stage-1 map = 0%, reduce = 0%

2017-02-08 15:25:22,711 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.46 sec

MapReduce Total cumulative CPU time: 4 seconds 460 msec

Ended Job = job_1486525566658_0003

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Cumulative CPU: 4.46 sec HDFS Read: 5345 HDFS Write: 0 SUCCESS

Total MapReduce CPU Time Spent: 4 seconds 460 msec

key value c2 c3

Time taken: 392.229 seconds

hive> select * from hbase_table_add1;select * from hbase_table_add1;
OK
hbase_table_add1.key    hbase_table_add1.value1 hbase_table_add1.value2 hbase_table_add1.value3
1       tom     2       101
2       lily    3       102
3       wwn     4       103
Time taken: 0.591 seconds, Fetched: 3 row(s)

登陆hbase中查看：

hbase(main):001:0> list
TABLE                                                                                                                                                                           
student                                                                                                                                                                         
student_info                                                                                                                                                                    
xyz                                                                                                                                                                             
3 row(s) in 0.9700 seconds

=> ["student", "student_info", "xyz"]
NameError: undefined local variable or method `scy' for #<Object:0x3946075>

hbase(main):003:0> scan 'student_info';
ROW                                           COLUMN+CELL                                                                                                                       
 1                                            column=city:nu, timestamp=1486538721232, value=101                                                                                
 1                                            column=info:col1, timestamp=1486538721232, value=tom                                                                              
 1                                            column=info:col2, timestamp=1486538721232, value=2                                                                                
 2                                            column=city:nu, timestamp=1486538721232, value=102                                                                                
 2                                            column=info:col1, timestamp=1486538721232, value=lily                                                                             
 2                                            column=info:col2, timestamp=1486538721232, value=3                                                                                
 3                                            column=city:nu, timestamp=1486538721232, value=103                                                                                
 3                                            column=info:col1, timestamp=1486538721232, value=wwn                                                                              
 3                                            column=info:col2, timestamp=1486538721232, value=4                                                                                
3 row(s) in 30.5480 seconds

put "student_info",'4','info:col1','haha';

hive

hive> select * from hbase_table_add1;
OK
hbase_table_add1.key    hbase_table_add1.value1 hbase_table_add1.value2 hbase_table_add1.value3
1       tom     2       101
2       lily    3       102
3       wwn     4       103
4       haha    NULL    NULL
Time taken: 0.443 seconds, Fetched: 4 row(s)

这里有3个hive的列,(value1和value2，value3),2个hbase的列簇(info,city)

hive的2列(value,和value2)对应1个hbase的列簇（info，在hbase的列名称col1,col2）,

hive的另外1列(value3)对应列nu位于city列簇。

这里实现了hive中表，多列存放到hbase少量固定的列簇中。