Hbase官方部署文档 hbasehive

转载

clghxq 2023-07-06 21:41:54

文章标签 Hbase官方部署文档 hive hadoop hbase 数据 文章分类 Hbase 数据库

Hbase 和 hive 的整合

hbase和hive的关系
整合Hbase和hive

整合原理：
整合方法：

关联全部hbase数据：
关联部分 hbase的数据：

hbase和hive的关系

hbase 是nosql分布式数据库
		表结构：是一个四维表
        可以实现近实时随机查询
        没有 join等分析函数
        
hive 是数据仓库
        表结构 和 hdfs 数据结构做一个映射，结构并没有改变
        擅长数据分析，提供了比较完善的函数

如果想对hbase中的数据做数据分析，目前有3种方式：
1）使用mapreduce；
2）整合hive；
3）用spark也可以做，spark也可以读hbase中的数据。
hbase自己做数据分析，语法不支持，所以将hbase和hive进行整合，便于对hbase的数据做统计分析.

整合Hbase和hive

整合原理：

hive读取hbase中的数据，将hbase中的数据转换为二维表数据。相当于hive将hbase中的数据进行压平。

hive进行整合hbase的核心包：hive-hbase-handler-2.3.2.jar，其中整合的核心方法是：包里的HBaseStorageHandler方法。

Hive 与 HBase 利用两者本身对外的 API 来实现整合，主要是靠 HBaseStorageHandler 进
行通信，利用 HBaseStorageHandler， Hive 可以获取到 Hive 表对应的 HBase 表名，列簇以及
列， InputFormat 和 OutputFormat 类，创建和删除 HBase 表等。

	Hive 访问 HBase 中表数据，实质上是通过 MapReduce 读取 HBase 表数据，其实现是在 MR
中，使用 HiveHBaseTableInputFormat 完成对 HBase 表的切分，获取 RecordReader 对象来读
取数据。

	对 HBase 表的切分原则是一个 Region 切分成一个 Split,即表中有多少个 Regions， MR 中就
有多少个 Map。

	读取 HBase 表数据都是通过构建 Scanner，对表进行全表扫描，如果有过滤条件，则转化为
Filter。当过滤条件为 rowkey 时，则转化为对 rowkey 的过滤， Scanner 通过 RPC 调用
RegionServer 的 next()来获取数据

整合方法：

以下操作是在hive的shell操作下：
先把hive设置本地模式：set hive.exec.mode.local.auto=true;

1）设置hbase的zk访问路径
set  hbase.zookeeper.quorum=hadoop01:2181,hadoop02:2181,hadoop03:2181;

2)设置hbase在zk的保存路径 （存储节点路径）,又叫寻址路径。
set zookeeper.znode.parent=/hbase;

3）将hive的解析hbase的jar包添加到hive的classpath下
add jar /home/jacob/app/apache-hive-2.3.2-bin/lib/hive-hbase-handler-2.3.2.jar;

查看是否添加过来：
list jars;

整合完之后，在hive中读取hbase的表
以下为举例：
在hbase中有如下的表：

hbase(main):004:0> scan "mingxing"
ROW                            COLUMN+CELL                                                                             
 rk001                         column=base_info:age, timestamp=1583625287636, value=33                                 
 rk001                         column=base_info:name, timestamp=1583625287196, value=huangbo                           
 rk001                         column=extra_info:math, timestamp=1583625287824, value=44                               
 rk001                         column=extra_info:province, timestamp=1583625287945, value=beijing                      
 rk002                         column=base_info:age, timestamp=1583625288187, value=44                                 
 rk002                         column=base_info:name, timestamp=1583625288086, value=xuzheng                           
 rk003                         column=base_info:age, timestamp=1583625288360, value=55                                 
 rk003                         column=base_info:gender, timestamp=1583625288438, value=male                            
 rk003                         column=base_info:name, timestamp=1583625288268, value=wangbaoqiang                      
 rk004                         column=extra_info:children, timestamp=1583625288698, value=3                            
 rk004                         column=extra_info:math, timestamp=1583625288500, value=33                               
 rk004                         column=extra_info:province, timestamp=1583625288585, value=tianjin                      
 rk005                         column=base_info:name, timestamp=1583625288795, value=liutao                            
 rk006                         column=extra_info:name, timestamp=1583625290356, value=liujialing

在hive建表，建表语句指定解析类。会转化成MR。

关联全部hbase数据：

create external table mingxing(rowkey string, base_info map<string, string>, extra_info map<string, string>) 
row format delimited fields terminated by '\t' 
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
with serdeproperties ("hbase.columns.mapping" = ":key,base_info:,extra_info:")
tblproperties ("hbase.table.name" = "mingxing");

简单解释：

with serdeproperties：指定hbase表结构 和hive的对应关系

hbase.columns.mapping：指定hbase表和hive表映射。
注意：映射是和hive中的建表语句一一对应的。
	指定hbase 对应值的时候，都是 key：value 这种形式。 k（列族名）:v（列族下的对应的列和值）
	此例中：      key:base_info
                value:  name:zs    age:12
	:key  获取rowkey的值
	
hbase.table.name：指定对应的表名

在hive中查看结果：

hive> select * from mingxing;
OK
mingxing.rowkey mingxing.base_info      mingxing.extra_info
rk001   {"age":"33","name":"huangbo"}   {"math":"44","province":"beijing"}
rk002   {"age":"44","name":"xuzheng"}   {}
rk003   {"age":"55","gender":"male","name":"wangbaoqiang"}      {}
rk004   {}      {"children":"3","math":"33","province":"tianjin"}
rk005   {"name":"liutao"}       {}
rk006   {}      {"name":"liujialing"}
Time taken: 0.907 seconds, Fetched: 6 row(s)

关联部分 hbase的数据：

仍以上面的mingxing表为例，想查询表中的姓名、年龄和数学成绩

create external 
create table mingxing_02(rowkey string,name string,age int,math int) 
row format delimited fields terminated by '\t' 
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
with serdeproperties ("hbase.columns.mapping" = ":key,base_info:name,base_info:age,extra_info:math") 
tblproperties ("hbase.table.name" = "mingxing");

在hive中查看结果：

hive> select * from mingxing_02;
OK
mingxing_02.rowkey      mingxing_02.name        mingxing_02.age mingxing_02.math
rk001   huangbo 33      44
rk002   xuzheng 44      NULL
rk003   wangbaoqiang    55      NULL
rk004   NULL    NULL    33
rk005   liutao  NULL    NULL
Time taken: 1.037 seconds, Fetched: 5 row(s)

rk006的extra_info里面没有math，base_info里面也没有数据，是一个空字段，所以结果中就不显示了。最终结果只显示5条。

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。