impala和hbase比较 impala与hbase

转载

mob6454cc63081f 2023-07-29 23:12:24

文章标签 impala和hbase比较字段表数据加载数据 文章分类 Hbase 数据库

一、性能验证

如果真要在生产环境中用，需要验证如下场景：

l 正向操作：在impala中通过sql insert大规模的加载或更新hbase的记录

l 反向操作：将hbase中的表导出到impala中形成可分析统计的表

上述场景如果不满足性能要求，就很难在生产中用于ETL了，而只能是做局部的小批量更新。

1. 样本数据准备

为了模拟大数据量，将字段扩展至200个，产生一个1000万条记录的全表数据。

1、在hbase中建被写入的表

create 'cust_full', 'cf_01', 'cf_02'

2、在impala中建一张1000万记录的表

CREATE TABLE cust_full(
cust_id   string,
col_01_001 string,
col_01_002 string,
col_01_003 string,
。。。
col_02_001 string,
col_02_002 string,
col_02_003 string,
col_02_004 string,
。。。
)row format delimited fields terminated by'|' lines terminated by '\n'
stored as textfile ;

并将准备好的文件上传至对应的impala hdfs目录下，形成表数据。

现在有了这张基础的impala大表，1000万条记录，200个字段。

2. 验证impala写入hbase性能

现尝试将这张大表insert到hbase中去，模拟宽表数据的写入过程。

[bd-131:21000] > insert into hbase_cust_full select * from cust_full ; Query: insert into hbase_cust_full select * from cust_full WARNINGS: RetriesExhaustedWithDetailsException: Failed 172 actions: RegionTooBusyException: 172 times, RetriesExhaustedWithDetailsException: Failed 86 actions: RegionTooBusyException: 86 times, RetriesExhaustedWithDetailsException: Failed 172 actions: RegionTooBusyException: 172 times, RetriesExhaustedWithDetailsException: Failed 86 actions: RegionTooBusyException: 86 times,

[bd-131:21000] > insert into hbase_cust_full select * from cust_full ;
Query: insert into hbase_cust_full select * from cust_full
WARNINGS: 
RetriesExhaustedWithDetailsException: Failed 172 actions: RegionTooBusyException: 172 times, 
 
RetriesExhaustedWithDetailsException: Failed 86 actions: RegionTooBusyException: 86 times, 
RetriesExhaustedWithDetailsException: Failed 172 actions: RegionTooBusyException: 172 times, 
RetriesExhaustedWithDetailsException: Failed 86 actions: RegionTooBusyException: 86 times,

这个问题是由于hbase在加载数据过程中产生了region split操作，会阻塞写入操作，在hbase开发过中比较常见。

尝试进行hbase表在创建上的优化，预建分区：

`disable 'cust_full' drop 'cust_full' create 'cust_full',{METHOD => 'table_att', MAX_FILESIZE => '6442450944'}, { NAME => 'cf_01'}, { NAME => 'cf_02'},{SPLITS => ['1001000000','1002000000','1003000000','1004000000','1005000000','1006000000','1007000000','1008000000','1009000000']`

disable 'cust_full'
drop 'cust_full'
create 'cust_full',{METHOD => 'table_att', MAX_FILESIZE => '6442450944'}, { NAME => 'cf_01'}, { NAME => 'cf_02'},{SPLITS => ['1001000000','1002000000','1003000000','1004000000','1005000000','1006000000','1007000000','1008000000','1009000000']

在insert的过程中，仍然出现同样的超时问题，导致失败，实际插入116万记录，>20分钟。因此，这里存在隐患：

l 预建分区的范围划分精确性

l 加载数据的不稳定性

如果不是通过程序来进行超时等待控制，很难控制其一次成功。

尝试只写入部分的字段（1个主键+2个字段）：

`[bd-131:21000] > insert into hbase_cust_full(cust_id,col_01_001,col_01_002) select cust_id,col_01_001,col_01_002 from cust_full ; Query: insert into hbase_cust_full(cust_id,col_01_001,col_01_002) select cust_id,col_01_001,col_01_002 from cust_full Inserted 10000000 row(s) in 136.67s`

[bd-131:21000] > insert into hbase_cust_full(cust_id,col_01_001,col_01_002) select cust_id,col_01_001,col_01_002 from cust_full ;
Query: insert into hbase_cust_full(cust_id,col_01_001,col_01_002) select cust_id,col_01_001,col_01_002 from cust_full
Inserted 10000000 row(s) in 136.67s

尝试加多字段写入（1个主键+10个字段

[bd-131:21000] > insert into hbase_cust_full(cust_id,col_01_001,col_01_002,col_01_003,col_01_004,col_01_005,col_01_006,col_01_007,col_01_008,col_0ect cust_id,col_01_001,col_01_002,col_01_003,col_01_004,col_01_005,col_01_006,col_01_007,col_01_008,col_01_009,col_01_010 from cust_full ; Query: insert into hbase_cust_full(cust_id,col_01_001,col_01_002,col_01_003,col_01_004,col_01_005,col_01_006,col_01_007,col_01_008,col_01_009,col_01_010) select cust_id,col_01_001,col_01_002,col_01_003,col_01_004,col_01_005,col_01_006,col_01_007,col_01_008,col_01_009,col_01_010 from cust_full WARNINGS: RetriesExhaustedWithDetailsException: Failed 1024 actions: IOException: 1024 times, RetriesExhaustedWithDetailsException: Failed 1024 actions: IOException: 1024 times, RetriesExhaustedWithDetailsException: Failed 1024 actions: IOException: 1024 times,

[bd-131:21000] > insert into hbase_cust_full(cust_id,col_01_001,col_01_002,col_01_003,col_01_004,col_01_005,col_01_006,col_01_007,col_01_008,col_0ect cust_id,col_01_001,col_01_002,col_01_003,col_01_004,col_01_005,col_01_006,col_01_007,col_01_008,col_01_009,col_01_010 from cust_full ;
Query: insert into hbase_cust_full(cust_id,col_01_001,col_01_002,col_01_003,col_01_004,col_01_005,col_01_006,col_01_007,col_01_008,col_01_009,col_01_010) select cust_id,col_01_001,col_01_002,col_01_003,col_01_004,col_01_005,col_01_006,col_01_007,col_01_008,col_01_009,col_01_010 from cust_full
WARNINGS: 
RetriesExhaustedWithDetailsException: Failed 1024 actions: IOException: 1024 times, 
RetriesExhaustedWithDetailsException: Failed 1024 actions: IOException: 1024 times, 
RetriesExhaustedWithDetailsException: Failed 1024 actions: IOException: 1024 times,

3. 验证hbase导出至impala性能

回写的性能验证：将刚刚写入的116万记录写回到Impala表

[bd-131:21000]> create table tt_1 as select * from hbase_cust_full ;
Query:create table tt_1 as select * from hbase_cust_full
+-------------------------+
|summary                 |
+-------------------------+
|Inserted 1164460 row(s) |
+-------------------------+
Fetched1 row(s) in 576.32s

需要耗时10分钟左右，导出的成本同样比较高，遍历从来都是k/v存储的弱项。