hbase的大表count hbase单表最大数据量_hbase的大表count


一、概述

https://hbase.apache.org/

HDFS:Hadoop分布式文件系统,适合非结构化数据的存储以及读写访问;

Apache HBase建立在HDFS之上的分布式、基于列存储的非关系型数据库;具有可靠、稳定、自动容错、多版本等特性;

HBase实际上是Google BigTable项目的开源实现,它适合海量大规模(数十亿行、数百万列)的结构化数据存储;

当需要随机、实时读写访问大数据时,使用HBase;

特点

  • 大: 表可以有数十亿行、数百万列,适合大规模的结构化数据存储
  • 面向列:面向列表(簇)的存储和权限控制,列(簇)独立检索。
  • 结构稀疏:对于为空(NULL)的列,并不占用存储空间,因此,表可以设计的非常稀疏。
  • 无模式:每一行都有一个可以排序的主键和任意多的列,列可以根据需要动态增加,同一张表中不同的行可以有截然不同的列。
  • 数据多版本:每个单元中的数据可以有多个版本,默认情况下,版本号自动分配,版本号就是单元格插入时的时间戳。
  • 数据类型单一:HBase中的数据在底层存储时使用byte[],没有类型。

数据模型

列存储和行存储

大多数的关系型数据库,都是行式存储系统;行式存储系统数据在物理介质(磁盘)存储组织方式逐行追加;擅长OLTP(联机事务处理)操作,指定字段查询时:

  • 可能会造成一些不必要的IO浪费
  • 通常情况下行式存储系统都是单机Server,规模严格硬件限制
  • 查询延迟较高

HBase是一个列式存储系统,按照列簇将数据进行分布式存储;擅长OLAP(联机分析处理),大规模结构化数据存储;

二、环境搭建

HBase伪分布式集群环境搭建

准备工作

  • JDK
  • ZooKeeper服务正常
  • HDFS服务正常
[root@HadoopNode00 ~]# java -versionjava version "1.8.0_181"Java(TM) SE Runtime Environment (build 1.8.0_181-b13)Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)[root@HadoopNode00 ~]# cd /home/zk/zookeeper-3.4.6/[root@HadoopNode00 zookeeper-3.4.6]# bin/zkServer.sh start conf/zk.cfgJMX enabled by defaultUsing config: conf/zk.cfgStarting zookeeper ... STARTED[root@HadoopNode00 zookeeper-3.4.6]# start-dfs.shStarting namenodes on [HadoopNode00]HadoopNode00: starting namenode, logging to /home/hadoop/hadoop-2.6.0/logs/hadoop-root-namenode-HadoopNode00.outlocalhost: starting datanode, logging to /home/hadoop/hadoop-2.6.0/logs/hadoop-root-datanode-HadoopNode00.outStarting secondary namenodes [0.0.0.0]0.0.0.0: starting secondarynamenode, logging to /home/hadoop/hadoop-2.6.0/logs/hadoop-root-secondarynamenode-HadoopNode00.out[root@HadoopNode00 zookeeper-3.4.6]# jps2480 SecondaryNameNode2113 NameNode2231 DataNode2681 Jps1851 QuorumPeerMain

安装

  • 上传安装包
  • 解压缩安装[root@HadoopNode00 ~]# tar -zxf /root/hbase-1.2.4-bin.tar.gz -C /usr
    [root@HadoopNode00 ~]# cd /usr/hbase-1.2.4/
    [root@HadoopNode00 hbase-1.2.4]# ll
    total 308
    drwxr-xr-x. 4 root root 4096 Jan 29 2016 bin # 指令目录
    -rw-r--r--. 1 root root 122439 Oct 26 2016 CHANGES.txt
    drwxr-xr-x. 2 root root 4096 Jan 29 2016 conf # 配置文件
    drwxr-xr-x. 7 root root 4096 Feb 15 2017 hbase-webapps # hbase web UI项目
    -rw-r--r--. 1 root root 261 Feb 15 2017 LEGAL
    drwxr-xr-x. 4 root root 4096 Dec 16 22:16 lib # 依赖jar包
    -rw-r--r--. 1 root root 122699 Feb 15 2017 LICENSE.txt
    -rw-r--r--. 1 root root 42642 Feb 15 2017 NOTICE.txt
    -rw-r--r--. 1 root root 1477 Dec 27 2015 README.txt

配置工作

hdfs-site.xml
[root@HadoopNode00 hbase-1.2.4]# vi conf/hbase-site.xml    hbase.rootdir    hdfs://HadoopNode00:9000/hbase    hbase.cluster.distributed    true    hbase.zookeeper.quorum    HadoopNode00    hbase.zookeeper.property.clientPort    2181
regionservers
[root@HadoopNode00 hbase-1.2.4]# vi conf/regionserversHadoopNode00
环境变量
[root@HadoopNode00 hbase-1.2.4]# vi ~/.bashrcexport HBASE_HOME=/usr/hbase-1.2.4export JAVA_HOME=/home/java/jdk1.8.0_181export HADOOP_HOME=/home/hadoop/hadoop-2.6.0export PROTBUF_HOME=/home/protobuf/protobuf-2.5.0export FINDBUGS_HOME=/home/findbugs/findbugs-3.0.1export MAVEN_HOME=/home/maven/apache-maven-3.3.9export M2_HOME=/home/maven/apache-maven-3.3.9export PATH=$PATH:$JAVA_HOME/bin:$MAVEN_HOME/bin:$M2_HOME/bin:$FINDBUGS_HOME/bin:$PROTBUF_HOME/bin:$HADOOP_HOME/sbin:$HADOOP_HOME/bin:$HBASE_HOME/bin[root@HadoopNode00 hbase-1.2.4]# source ~/.bashrc

启动服务

[root@HadoopNode00 hbase-1.2.4]# start-hbase.shHadoopNode00: starting zookeeper, logging to /usr/hbase-1.2.4/logs/hbase-root-zookeeper-HadoopNode00.outstarting master, logging to /usr/hbase-1.2.4/logs/hbase-root-master-HadoopNode00.outJava HotSpot(TM) 64-Bit Server VM warning: ignoring option PermSize=128m; support was removed in 8.0Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0HadoopNode00: starting regionserver, logging to /usr/hbase-1.2.4/logs/hbase-root-regionserver-HadoopNode00.outHadoopNode00: Java HotSpot(TM) 64-Bit Server VM warning: ignoring option PermSize=128m; support was removed in 8.0HadoopNode00: Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0[root@HadoopNode00 hbase-1.2.4]# jps2480 SecondaryNameNode2113 NameNode8082 HRegionServer  # 从服务8438 Jps7927 HMaster        # 主服务2231 DataNode1851 QuorumPeerMain

访问web UI

http://hadoopnode00:16010/master-status

三、使用

指令操作

进入指令窗口
[root@HadoopNode00 hbase-1.2.4]# hbase shellSLF4J: Class path contains multiple SLF4J bindings.SLF4J: Found binding in [jar:file:/usr/hbase-1.2.4/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]HBase Shell; enter 'help' for list of supported commands.Type "exit" to leave the HBase ShellVersion 1.2.4, rUnknown, Wed Feb 15 18:58:00 CST 2017hbase(main):002:0* helpHBase Shell, version 1.2.4, rUnknown, Wed Feb 15 18:58:00 CST 2017Type 'help "COMMAND"', (e.g. 'help "get"' -- the quotes are necessary) for help on a specific command.Commands are grouped. Type 'help "COMMAND_GROUP"', (e.g. 'help "general"') for help on a command group.COMMAND GROUPS:  Group name: general   # 常用 一般命令组  Commands: status, table_help, version, whoami  Group name: ddl       # 对表的操作  Commands: alter, alter_async, alter_status, create, describe, disable, disable_all, drop, drop_all, enable, enable_all, exists, get_table, is_disabled, is_enabled, list, locate_region, show_filters  Group name: namespace # 类似于MySQL数据库的概念 组织管理表  Commands: alter_namespace, create_namespace, describe_namespace, drop_namespace, list_namespace, list_namespace_tables  Group name: dml       # 对数据操作  Commands: append, count, delete, deleteall, get, get_counter, get_splits, incr, put, scan, truncate, truncate_preserve  Group name: tools     # 工具命令组  Commands: assign, balance_switch, balancer, balancer_enabled, catalogjanitor_enabled, catalogjanitor_run, catalogjanitor_switch, close_region, compact, compact_rs, flush, major_compact, merge_region, move, normalize, normalizer_enabled, normalizer_switch, split, trace, unassign, wal_roll, zk_dump  Group name: replication  Commands: add_peer, append_peer_tableCFs, disable_peer, disable_table_replication, enable_peer, enable_table_replication, list_peers, list_replicated_tables, remove_peer, remove_peer_tableCFs, set_peer_tableCFs, show_peer_tableCFs  Group name: snapshots  Commands: clone_snapshot, delete_all_snapshot, delete_snapshot, list_snapshots, restore_snapshot, snapshot  Group name: configuration  Commands: update_all_config, update_config  Group name: quotas  Commands: list_quotas, set_quota  Group name: security  Commands: grant, list_security_capabilities, revoke, user_permission  Group name: procedures  Commands: abort_procedure, list_procedures  Group name: visibility labels  Commands: add_labels, clear_auths, get_auths, list_labels, set_auths, set_visibility
普通指令
status
hbase(main):003:0> status1 active master, 0 backup masters, 1 servers, 0 dead, 2.0000 average load
table_help
hbase(main):004:0> table_helpHelp for table-reference commands.You can either create a table via 'create' and then manipulate the table via commands like 'put', 'get', etc.See the standard help information for how to use each of these commands.However, as of 0.96, you can also get a reference to a table, on which you can invoke commands.For instance, you can get create a table and keep around a reference to it via:   hbase> t = create 't', 'cf'Or, if you have already created the table, you can get a reference to it:   hbase> t = get_table 't'You can do things like call 'put' on the table:  hbase> t.put 'r', 'cf:q', 'v'which puts a row 'r' with column family 'cf', qualifier 'q' and value 'v' into table t.To read the data out, you can scan the table:  hbase> t.scanwhich will read all the rows in table 't'.Essentially, any command that takes a table name can also be done via table reference.Other commands include things like: get, delete, deleteall,get_all_columns, get_counter, count, incr. These functions, along withthe standard JRuby object methods are also available via tab completion.For more information on how to use each of these commands, you can also just type:   hbase> t.help 'scan'which will output more information on how to use that command.You can also do general admin actions directly on a table; things like enable, disable,flush and drop just by typing:   hbase> t.enable   hbase> t.flush   hbase> t.disable   hbase> t.dropNote that after dropping a table, your reference to it becomes useless and further usageis undefined (and not recommended).
version&whoami
hbase(main):005:0> version1.2.4, rUnknown, Wed Feb 15 18:58:00 CST 2017hbase(main):006:0> whoamiroot (auth:SIMPLE)    groups: root
NameSpace
alter_namespace

修改namespace

hbase(main):005:0> alter_namespace 'baizhi2',{METHOD=>'set','k1'=>'v1'}0 row(s) in 0.0470 secondshbase(main):006:0> describe_namespace 'baizhi2'DESCRIPTION{NAME => 'baizhi2', AUTHOR => 'gaozhy', k1 => 'v1'}1 row(s) in 0.0050 secondshbase(main):007:0> alter_namespace 'baizhi2',{METHOD=>'unset',NAME=>'k1'}0 row(s) in 0.0230 secondshbase(main):008:0> describe_namespace 'baizhi2'DESCRIPTION{NAME => 'baizhi2', AUTHOR => 'gaozhy'}1 row(s) in 0.0060 seconds
create_namespace

创建namespace

hbase(main):009:0> create_namespace 'baizhi'0 row(s) in 0.0620 secondshbase(main):010:0> create_namespace 'baizhi2',{'AUTHOR'=>'gaozhy'}0 row(s) in 0.0200 seconds
describe_namespace

展示namespace详细信息

hbase(main):014:0> describe_namespace 'baizhi2'DESCRIPTION{NAME => 'baizhi2', AUTHOR => 'gaozhy'}1 row(s) in 0.0190 seconds
drop_namespace

删除namespace

hbase(main):001:0> drop_namespace 'baizhi'0 row(s) in 0.2090 secondshbase(main):002:0> list_namespaceNAMESPACEbaizhi2defaulthbase3 row(s) in 0.0260 seconds
list_namespace

展示namespace列表

hbase(main):011:0> list_namespaceNAMESPACEbaizhibaizhi2defaulthbase4 row(s) in 0.0320 seconds
list_namespace_tables

展示指定namespace表列表

hbase(main):003:0> list_namespace_tables 'hbase'TABLEmetanamespace2 row(s) in 0.0250 seconds
DDL

对HBase表的操作

alter

修改表

hbase(main):028:0> alter 'baizhi2:tt_user',{NAME => 'cf1', IN_MEMORY => true, VERSIONS => 3, TTL=>7200}Updating all regions with the new schema...1/1 regions updated.Done.0 row(s) in 1.9080 secondshbase(main):029:0> describe 'baizhi2:tt_user'Table baizhi2:tt_user is ENABLEDbaizhi2:tt_userCOLUMN FAMILIES DESCRIPTION{NAME => 'cf1', BLOOMFILTER => 'ROW', VERSIONS => '3', IN_MEMORY => 'true', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING=> 'NONE', TTL => '7200 SECONDS (2 HOURS)', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}1 row(s) in 0.0090 seconds
create

创建表

create '数据库名:表名','列簇1','列簇2'...

hbase(main):013:0* create 't_user','cf1','cf2'0 row(s) in 1.2860 seconds=> Hbase::Table - t_userhbase(main):014:0> list_namespacelist_namespace          list_namespace_tableshbase(main):014:0> list_namespace_tables 'default'TABLEt_user1 row(s) in 0.0120 secondshbase(main):015:0> create 'baizhi2:tt_user','cf1'0 row(s) in 1.2210 seconds=> Hbase::Table - baizhi2:tt_userhbase(main):016:0> list_namespace_tables 'baizhi2'TABLEtt_user1 row(s) in 0.0070 seconds
describe

描述表

describe '数据库名:表名'

hbase(main):019:0* describe 'baizhi2:tt_user'Table baizhi2:tt_user is ENABLED   # 开启启用状态baizhi2:tt_userCOLUMN FAMILIES DESCRIPTION{NAME => 'cf1', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}1 row(s) in 0.0880 seconds
disable & disable_all
hbase(main):034:0> disable 'default:t_user'0 row(s) in 2.2470 seconds
drop&drop_all

注意:删除表时,需要首先禁用表

hbase(main):033:0> drop 'default:t_user'ERROR: Table default:t_user is enabled. Disable it first.Here is some help for this command:Drop the named table. Table must first be disabled:  hbase> drop 't1'  hbase> drop 'ns1:t1'hbase(main):034:0> disable 'default:t_user'0 row(s) in 2.2470 secondshbase(main):035:0> drop 'default:t_user'0 row(s) in 1.2630 seconds
enable & enable_all

启用表

hbase(main):038:0> enable 'baizhi2:tt_user'0 row(s) in 1.2460 seconds
exists

判断表是否存在

hbase(main):039:0> exists 'baizhi2:tt_user'Table baizhi2:tt_user does exist0 row(s) in 0.0070 secondshbase(main):040:0> exists 'baizhi2:t_user'Table baizhi2:t_user does not exist0 row(s) in 0.0090 seconds
get_table

获取一个表的引用名

hbase(main):044:0> t1 = get_table 'baizhi2:tt_user'0 row(s) in 0.0010 seconds=> Hbase::Table - baizhi2:tt_user


hbase的大表count hbase单表最大数据量_hbase安装包_02


is_disabled & is_enabled
hbase(main):041:0> is_enabled 'baizhi2:tt_user'true0 row(s) in 0.0060 secondshbase(main):042:0> is_disabled 'baizhi2:tt_user'false0 row(s) in 0.0080 seconds
list

展示HBase中用户创建表

hbase(main):001:0> listTABLEbaizhi2:tt_user1 row(s) in 0.2270 seconds

DML

append
hbase(main):022:0> put 'baizhi2:tt_user','user001','cf1:name','zs'0 row(s) in 0.0060 secondshbase(main):023:0> append 'baizhi2:tt_user','user001','cf1:sex',false0 row(s) in 0.0160 secondshbase(main):024:0> scan 'baizhi2:tt_user'ROW                              COLUMN+CELL user001                         column=cf1:name, timestamp=1576513629068, value=zs user001                         column=cf1:sex, timestamp=1576513670167, value=false1 row(s) in 0.0100 secondshbase(main):025:0> append 'baizhi2:tt_user','user001','cf1:name',zssNameError: undefined local variable or method `zss' for #<0x36c7cbe1>hbase(main):026:0> append 'baizhi2:tt_user','user001','cf1:name','zss'0 row(s) in 0.0150 secondshbase(main):027:0> scan 'baizhi2:tt_user'ROW                              COLUMN+CELL user001                         column=cf1:name, timestamp=1576513706032, value=zszss user001                         column=cf1:sex, timestamp=1576513670167, value=false1 row(s) in 0.0100 seconds0x36c7cbe1>
count

返回表中记录条数

hbase(main):015:0> count 'baizhi2:tt_user'1 row(s) in 0.0360 seconds=> 1
delete

只能删除一个单元格的内容

hbase(main):002:0> get 'baizhi2:tt_user', 'user001', {COLUMN => 'cf1:name',VERSIONS => 3}COLUMN                           CELL cf1:name                        timestamp=1576512530631, value=zs4 cf1:name                        timestamp=1576512345974, value=zs3 cf1:name                        timestamp=1576512334603, value=zs23 row(s) in 0.2640 secondshbase(main):003:0> delete 'baizhi2:tt_user','user001','cf1:name',15765123346030 row(s) in 0.0580 secondshbase(main):004:0> get 'baizhi2:tt_user', 'user001', {COLUMN => 'cf1:name',VERSIONS => 3}COLUMN                           CELL cf1:name                        timestamp=1576512530631, value=zs4 cf1:name                        timestamp=1576512345974, value=zs32 row(s) in 0.0050 secondshbase(main):005:0> delete 'baizhi2:tt_user','user001','cf1:age'0 row(s) in 0.0100 secondshbase(main):006:0> get 'baizhi2:tt_user','user001'COLUMN                           CELL cf1:name                        timestamp=1576512530631, value=zs4 cf1:sex                         timestamp=1576512020638, value=false2 row(s) in 0.0050 seconds
deleteall

删除一行记录

hbase(main):008:0> deleteall 'baizhi2:tt_user','user001'0 row(s) in 0.0040 secondshbase(main):009:0> get 'baizhi2:tt_user','user001'COLUMN                           CELL0 row(s) in 0.0020 seconds
get【重点】

根据RowKey查询结果

查询一行

查询某个单元格

//--------------------------查一行------------------------------------hbase(main):012:0>hbase(main):013:0* get 'baizhi2:tt_user','user001'COLUMN                           CELL cf1:age                         timestamp=1576512063588, value=18 cf1:name                        timestamp=1576512530631, value=zs4 cf1:sex                         timestamp=1576512020638, value=false3 row(s) in 0.0170 seconds//--------------------------查某个单元格------------------------------------hbase(main):014:0> get 'baizhi2:tt_user','user001','cf1:name'COLUMN                           CELL cf1:name                        timestamp=1576512530631, value=zs41 row(s) in 0.0040 seconds//--------------------------多版本查询------------------------------------hbase(main):011:0> get 'baizhi2:tt_user', 'user001', {COLUMN => 'cf1:name',VERSIONS => 3}COLUMN                           CELL cf1:name                        timestamp=1576512530631, value=zs4 cf1:name                        timestamp=1576512345974, value=zs3 cf1:name                        timestamp=1576512334603, value=zs23 row(s) in 0.0110 seconds
put【重点】

插入记录或者更新记录指令

//--------------------------插值------------------------------------hbase(main):003:0> put 'baizhi2:tt_user','user001','cf1:name','zs'0 row(s) in 0.1530 secondshbase(main):004:0> put 'baizhi2:tt_user','user001','cf1:sex',false0 row(s) in 0.0090 secondshbase(main):001:0> put 'baizhi2:tt_user','user001','cf1:age',18
scan【重点】

浏览表或者扫描表

查所有

hbase(main):010:0> put 'baizhi2:tt_user','user001','cf1:name','zs'0 row(s) in 0.0500 secondshbase(main):011:0> put 'baizhi2:tt_user','user002','cf1:name','ls'0 row(s) in 0.0080 secondshbase(main):012:0> put 'baizhi2:tt_user','user003','cf1:name','ww'0 row(s) in 0.0050 secondshbase(main):013:0> scan 'baizhi2:tt_user'ROW                              COLUMN+CELL user001                         column=cf1:name, timestamp=1576513147139, value=zs user002                         column=cf1:name, timestamp=1576513158553, value=ls user003                         column=cf1:name, timestamp=1576513166882, value=ww3 row(s) in 0.0260 seconds
truncate

截断表

删除表中的所有数据,效率高于Delete

hbase(main):016:0> truncate 'baizhi2:tt_user'Truncating 'baizhi2:tt_user' table (it may take a while): - Disabling table... - Truncating table...0 row(s) in 3.4820 secondshbase(main):017:0> scan 'baizhi2:tt_user'ROW                              COLUMN+CELL0 row(s) in 0.1230 seconds

JAVA API操作

导入依赖
org.apache.hbase     hbase-client     1.2.4org.apache.hbase    hbase-common    1.2.4org.apache.hbase    hbase-protocol    1.2.4org.apache.hbase    hbase-server    1.2.4junit    junit    4.12
开发应用
import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.hbase.*;import org.apache.hadoop.hbase.client.*;import org.apache.hadoop.hbase.filter.PrefixFilter;import org.apache.hadoop.hbase.util.Bytes;import org.junit.After;import org.junit.Before;import org.junit.Test;import java.io.IOException;import java.util.Arrays;import java.util.List;/** * java api 操作hbase NOSQL数据库 */public class HBaseTest {    // 管理员对象 DDL    private Admin admin = null;    // 连接对象  DML    private Connection connection = null;    /**     * 初始化方法     */    @Before    public void doBefore() throws IOException {        Configuration conf = HBaseConfiguration.create();        // zk服务主机名, HBase集群入口在ZK存的,client需要连接zk获取访问入口        conf.set(HConstants.ZOOKEEPER_QUORUM, "HadoopNode00");        conf.set(HConstants.ZOOKEEPER_CLIENT_PORT, "2181");        connection = ConnectionFactory.createConnection(conf);        admin = connection.getAdmin();    }    /**     * NameSpace操作     */    @Test    public void test1() {        NamespaceDescriptor namespaceDescriptor = NamespaceDescriptor.create("ns").build();        try {            // 创建            admin.createNamespace(namespaceDescriptor);            // admin.deleteNamespace("ns");        } catch (IOException e) {            e.printStackTrace();        }    }    /**     * DDL操作     *     *
 * 表(增) */ @Test public void test2() { HTableDescriptor hTableDescriptor = new HTableDescriptor(TableName.valueOf("ns:t_user")); HColumnDescriptor cf1 = new HColumnDescriptor("cf1"); cf1.setMaxVersions(3); HColumnDescriptor cf2 = new HColumnDescriptor("cf2"); cf2.setInMemory(true); cf2.setTimeToLive(3600 * 24 * 7); hTableDescriptor.addFamily(cf1); hTableDescriptor.addFamily(cf2); try { admin.createTable(hTableDescriptor); } catch (IOException e) { e.printStackTrace(); } } /** * DDL操作 * * 
    
 * 表(删 改 查) */ @Test public void test3() { try { // admin.deleteTable(TableName.valueOf("ns:t_user")); // HColumnDescriptor cf1 = new HColumnDescriptor("cf1"); // cf1.setMaxVersions(5); // admin.modifyColumn(TableName.valueOf("ns:t_user"),cf1); HTableDescriptor tableDescriptor = admin.getTableDescriptor(TableName.valueOf("ns:t_user")); HColumnDescriptor[] columnFamilies = tableDescriptor.getColumnFamilies(); for (HColumnDescriptor columnFamily : columnFamilies) { System.out.println(new String(columnFamily.getName()) + ", 多版本:" + columnFamily.getMaxVersions()); } } catch (IOException e) { e.printStackTrace(); } } /** * DML操作【插值】 * 
    
 * put 'ns:t_user','user001','cf1:name','zs' */ @Test public void test4() throws IOException { Table table = connection.getTable(TableName.valueOf("ns:t_user"));// Put p1 = new Put("user001".getBytes());// p1.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("name"), Bytes.toBytes("zs1"));// // 注意: Bytes是HBase提供的一个工具类,主要功能是对象的序列化和反序列化// Put p2 = new Put(Bytes.toBytes("user002"));// p2.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("name"), Bytes.toBytes("ls1"));// table.put(Arrays.asList(p1, p2)); Put p3 = new Put("person001".getBytes()); p3.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("name"), Bytes.toBytes("wb")); table.put(Arrays.asList(p3)); } /** * DML操作【取值】 * 
    
 * get 'ns:t_user','user001' * get 'baizhi2:tt_user', 'user001', {COLUMN => 'cf1:name',VERSIONS => 3} */ @Test public void test5() throws IOException { Table table = connection.getTable(TableName.valueOf("ns:t_user")); Get get = new Get(Bytes.toBytes("user001")); /* Result result = table.get(get); String name = Bytes.toString(result.getValue(Bytes.toBytes("cf1"), Bytes.toBytes("name"))); System.out.println("name=" + name); */ get.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("name")); get.setMaxVersions(3); Result result = table.get(get); List cells = result.getColumnCells(Bytes.toBytes("cf1"), Bytes.toBytes("name")); cells.forEach(cell -> { String name = Bytes.toString(cell.getValue()); Long version = cell.getTimestamp(); System.out.println(name + "" + version); }); } /** * DML操作【删除】 * 
 * delete 'baizhi2:tt_user','user001','cf1:name',1576512334603 */ @Test public void test6() throws IOException { Table table = connection.getTable(TableName.valueOf("ns:t_user")); Delete delete = new Delete(Bytes.toBytes("user002")); // 删除单元格内容 如果需要删除整行则无需添加列信息 delete.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("name"), 1576517018101L); table.delete(delete); } /** * DML操作【扫描】 * 
    
 * delete 'baizhi2:tt_user','user001','cf1:name',1576512334603 */ @Test public void test7() throws IOException { Table table = connection.getTable(TableName.valueOf("ns:t_user")); Scan scan = new Scan(); scan.setStartRow(Bytes.toBytes("user001")); scan.setStopRow(Bytes.toBytes("user003")); scan.setFilter(new PrefixFilter(Bytes.toBytes("user"))); // rowkey以user开头所有数据行 // result结果集合 ResultScanner rs = table.getScanner(scan); for (Result result : rs) { String rowkey = Bytes.toString(result.getRow()); String name = Bytes.toString(result.getValue(Bytes.toBytes("cf1"), Bytes.toBytes("name"))); // Integer age = Bytes.toInt(result.getValue(Bytes.toBytes("cf1"), Bytes.toBytes("age"))); // Boolean sex = Bytes.toBoolean(result.getValue(Bytes.toBytes("cf1"), Bytes.toBytes("sex"))); System.out.println(rowkey + "" + name); } } /** * 释放资源 */ @After public void doAfter() { if (admin != null) { try { admin.close(); } catch (IOException e) { e.printStackTrace(); } } if (connection != null) { try { connection.close(); } catch (IOException e) { e.printStackTrace(); } } }}


四、HBase On MapReduce

大数据解决什么问题?

  • 海量数据的采集(自动化数据采集工具)
  • 海量数据的存储(分布式存储集群 HBase & HDFS)
  • 海量数据的处理(分布式计算 MapReduce)

具体需求

插入测试数据

import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.hbase.HBaseConfiguration;import org.apache.hadoop.hbase.HConstants;import org.apache.hadoop.hbase.TableName;import org.apache.hadoop.hbase.client.*;import org.apache.hadoop.hbase.util.Bytes;import org.junit.After;import org.junit.Before;import org.junit.Test;import java.io.IOException;import java.util.Arrays;public class OrderDataTest {    // 管理员对象 DDL    private Admin admin = null;    // 连接对象  DML    private Connection connection = null;    /**     * 初始化方法     */    @Before    public void doBefore() throws IOException {        Configuration conf = HBaseConfiguration.create();        // zk服务主机名, HBase集群入口在ZK存的,client需要连接zk获取访问入口        conf.set(HConstants.ZOOKEEPER_QUORUM, "HadoopNode00");        conf.set(HConstants.ZOOKEEPER_CLIENT_PORT, "2181");        connection = ConnectionFactory.createConnection(conf);        admin = connection.getAdmin();    }    @Test    public void test1() throws IOException {        // user001:2019  5000        // user001:2019  4000        // user002:2019  11000        // user003:2019  7000        Table table = connection.getTable(TableName.valueOf("baizhi2:t_order"));        Put p1 = new Put(Bytes.toBytes("201911111203111:user001"));        p1.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("total"), Bytes.toBytes(2000.0D));        Put p2 = new Put(Bytes.toBytes("201911111204003:user001"));        p2.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("total"), Bytes.toBytes(3000.0D));        Put p3 = new Put(Bytes.toBytes("201802031204003:user001"));        p3.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("total"), Bytes.toBytes(4000.0D));        Put p4 = new Put(Bytes.toBytes("201911111204003:user002"));        p4.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("total"), Bytes.toBytes(5000.0D));        Put p5 = new Put(Bytes.toBytes("201912121201003:user002"));        p5.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("total"), Bytes.toBytes(6000.0D));        Put p6 = new Put(Bytes.toBytes("201911111204003:user003"));        p6.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("total"), Bytes.toBytes(7000.0D));        table.put(Arrays.asList(p1,p2,p3,p4,p5,p6));    }    /**     * 释放资源     */    @After    public void doAfter() {        if (admin != null) {            try {                admin.close();            } catch (IOException e) {                e.printStackTrace();            }        }        if (connection != null) {            try {                connection.close();            } catch (IOException e) {                e.printStackTrace();            }        }    }}

开发MapReduce应用计算订单数据

MyMapper
package job;import org.apache.hadoop.hbase.client.Result;import org.apache.hadoop.hbase.io.ImmutableBytesWritable;import org.apache.hadoop.hbase.mapreduce.TableMapper;import org.apache.hadoop.hbase.util.Bytes;import org.apache.hadoop.io.DoubleWritable;import org.apache.hadoop.io.Text;import java.io.IOException;/** * map task任务类 * 接受hbase中的数据 映射为kv */public class MyMapper extends TableMapper {    /**     * @param key     rowkey     * @param value   hbase中的一行记录     * @param context     * @throws IOException     * @throws InterruptedException     */    @Override    protected void map(ImmutableBytesWritable key, Result value, Context context) throws IOException, InterruptedException {        // 201911111203111:user001        String rowkey = Bytes.toString(key.get());        String[] strs = rowkey.split(":");        String userId = strs[1];        String year = strs[0].substring(0, 4);        String k = userId + "-" + year;  // user001-2019        Double v = Bytes.toDouble(value.getValue(Bytes.toBytes("cf1"), Bytes.toBytes("total")));        context.write(new Text(k), new DoubleWritable(v));    }}
MyReducer
package job;import org.apache.hadoop.hbase.client.Put;import org.apache.hadoop.hbase.mapreduce.TableReducer;import org.apache.hadoop.hbase.util.Bytes;import org.apache.hadoop.io.DoubleWritable;import org.apache.hadoop.io.NullWritable;import org.apache.hadoop.io.Text;import java.io.IOException;/** * 接受map task输出结果 *

* 计算结果写出HBase中 */public class MyReducer extends TableReducer { // 注意:在向hbase写出时与keyout无关 /** * @param key userid-year * @param values [total,total] * @param context * @throws IOException * @throws InterruptedException */ @Override protected void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { double sum = 0.0D; for (DoubleWritable value : values) { sum += value.get(); } Put put = new Put(Bytes.toBytes(key.toString())); put.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("yeartotal"), Bytes.toBytes(sum)); context.write(null, put); }}


MyJob
package job;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.hbase.HBaseConfiguration;import org.apache.hadoop.hbase.HConstants;import org.apache.hadoop.hbase.TableName;import org.apache.hadoop.hbase.client.Scan;import org.apache.hadoop.hbase.filter.PrefixFilter;import org.apache.hadoop.hbase.mapreduce.TableInputFormat;import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;import org.apache.hadoop.hbase.mapreduce.TableOutputFormat;import org.apache.hadoop.io.DoubleWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import java.io.IOException;public class MyJob {    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {        Configuration conf = HBaseConfiguration.create();        conf.set(HConstants.ZOOKEEPER_QUORUM, "HadoopNode00");        conf.set(HConstants.ZOOKEEPER_CLIENT_PORT, "2181");        Job job = Job.getInstance(conf, "hbase on mapreduce");        job.setJarByClass(MyJob.class);        // 设定inputFormat和outputFormat        job.setInputFormatClass(TableInputFormat.class);        job.setOutputFormatClass(TableOutputFormat.class);        Scan scan = new Scan();        scan.setFilter(new PrefixFilter("2019".getBytes()));        // 打包完成        // 初始化MapTask        TableMapReduceUtil.initTableMapperJob(                TableName.valueOf("baizhi2:t_order"),                scan,                MyMapper.class,                Text.class,                DoubleWritable.class,                job);        // 初始化ReduceTask        TableMapReduceUtil.initTableReducerJob(                "baizhi2:t_result",                MyReducer.class,                job);        // 提交任务        job.waitForCompletion(true);    }}
查看计算结果
log4j:WARN No appenders could be found for logger (org.apache.hadoop.security.Groups).log4j:WARN Please initialize the log4j system properly.log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.user001-20195000.0user002-201911000.0user003-20197000.0Process finished with exit code 0


hbase的大表count hbase单表最大数据量_hadoop_03


五、HBase HA完全分布式集群

准备工作

# 1. 首先需要HDFS和ZooKeeper服务集群[root@HadoopNodeX zookeeper-3.4.6]# bin/zkServer.sh start conf/zook.cfg[root@HadoopNodeX zookeeper-3.4.6]# bin/zkServer.sh status conf/zook.cfgJMX enabled by defaultUsing config: conf/zook.cfgMode: follower# 2. 同步集群时间[root@HadoopNode0X zookeeper-3.4.6]# dateFri Dec 13 19:07:49 CST 2019[root@HadoopNode0X zookeeper-3.4.6]# date -s '2019-12-17 11:33:30'Tue Dec 17 11:33:30 CST 2019[root@HadoopNode0X zookeeper-3.4.6]# clock -w# 3. 启动HDFS集群[root@HadoopNode01 zookeeper-3.4.6]# start-dfs.sh

HBase安装

[root@HadoopNode01 zookeeper-3.4.6]# scp ~/hbase-1.2.4-bin.tar.gz root@HadoopNode02:~[root@HadoopNode01 zookeeper-3.4.6]# scp ~/hbase-1.2.4-bin.tar.gz root@HadoopNode03:~[root@HadoopNode0X zookeeper-3.4.6]# tar -zxf ~/hbase-1.2.4-bin.tar.gz -C /usr

HBase配置

hbase-site.xml
[root@HadoopNode0X zookeeper-3.4.6]# vi /usr/hbase-1.2.4/conf/hbase-site.xmlhbase.rootdirhdfs://mycluster/hbase2hbase.cluster.distributedtruehbase.zookeeper.quorumHadoopNode01,HadoopNode02,HadoopNode03hbase.zookeeper.property.clientPort2181
hregionservers
[root@HadoopNode0X zookeeper-3.4.6]# vi /usr/hbase-1.2.4/conf/regionserversHadoopNode01HadoopNode02HadoopNode03
配置环境变量
[root@HadoopNode0X zookeeper-3.4.6]# vi ~/.bashrcexport HBASE_HOME=/usr/hbase-1.2.4export JAVA_HOME=/home/java/jdk1.8.0_181export HADOOP_HOME=/home/hadoop/hadoop-2.6.0export PROTBUF_HOME=/home/protobuf/protobuf-2.5.0export FINDBUGS_HOME=/home/findbugs/findbugs-3.0.1export MAVEN_HOME=/home/maven/apache-maven-3.3.9export M2_HOME=/home/maven/apache-maven-3.3.9export PATH=$PATH:$JAVA_HOME/bin:$MAVEN_HOME/bin:$M2_HOME/bin:$FINDBUGS_HOME/bin:$PROTBUF_HOME/bin:$HADOOP_HOME/sbin:$HADOOP_HOME/bin:$HBASE_HOME/bin[root@HadoopNode01 ~]# source ~/.bashrc

启动HBase HA集群

HA:高可用集群

[root@HadoopNode0X ~]# start-hbase.sh

六、HBase架构篇

HBase采用Master/Slave架构搭建集群,它隶属于Hadoop生态系统,由以下类型节点组成: HMaster 节点、HRegionServer 节点、 ZooKeeper 集群,而在底层,它将数据存储于HDFS中,因而涉及到HDFS的NameNode、DataNode等,总体结构如下:

HMaster节点用于:

  1. 管理HRegionServer,实现HRegion的负载均衡。
  2. 管理和分配HRegion,比如在HRegion split时分配新的HRegion;在HRegionServer退出时迁移其内的HRegion到其他HRegionServer上。
  3. 实现DDL操作(Data Definition Language,namespace和table的增删改,column familiy的增删改等)。
  4. 管理namespace和table的元数据(实际存储在HDFS上)。
  5. 权限控制(ACL)。

总结:

主要职责负责集群管理、HRegion的负载均衡和容错处理、DDL、NameSpace和权限操作

HRegionServer节点用于:

  1. 存放和管理本地HRegion。
  2. 读写HDFS,管理Table中的数据。【表的数据一定在从节点存储】
  3. Client直接通过HRegionServer读写数据(从HMaster中获取元数据,找到RowKey所在的HRegion/HRegionServer后)。

总结:

主要职责负责数据真实的读写操作、管理HRegion、HDFS数据交互

ZooKeeper集群用于:

  1. 存放整个 HBase集群的元数据以及集群的状态信息。
  2. 实现HMaster主从节点的failover。
  3. HA集群中负责主备节点的动态切换
HRegion

HBase使用RowKey将表水平切割成多个HRegion,从HMaster的角度,每个HRegion都纪录了它的StartKey和EndKey(第一个HRegion的StartKey为空,最后一个HRegion的EndKey为空),由于RowKey是排序的,因而Client可以通过HMaster快速的定位每个RowKey在哪个HRegion中。HRegion由HMaster分配到相应的HRegionServer中,然后由HRegionServer负责HRegion的启动和管理,和Client的通信,负责数据的读(使用HDFS)。

执行读取指令:get 'baizhi2:tt_user','user001'

  1. client首先连接zookeeper,获取元数据表的HRegion地址
  2. 通过访问元数据表的HRegion,定位RowKey对应的Region所在地址
  3. 读写访问Region,完成数据操作
  4. 在获取数据时,首先会从HStore的MemStore中寻找,如果没有找到则找StoreFile获取数据

HMaster

HMaster没有单点故障问题,可以启动多个HMaster,通过ZooKeeper的Master Election机制保证同时只有一个HMaster出于Active状态,其他的HMaster则处于热备份状态。一般情况下会启动两个HMaster,非Active的HMaster会定期的和Active HMaster通信以获取其最新状态,从而保证它是实时更新的,因而如果启动了多个HMaster反而增加了Active HMaster的负担。要有两方面的职责:

  • 管理协调HRegionServer
  • 管理HRegion的分配,以及负载均衡和修复时HRegion的重新分配。
  • 监控集群中所有HRegionServer的状态(通过Heartbeat和监听ZooKeeper中的状态)。
  • Admin职能 NameSpace
  • 创建、删除、修改Table的定义。 DDL

ZooKeeper

ZooKeeper为HBase集群提供协调服务,它管理着HMaster和HRegionServer的状态(available/alive等),并且会在它们宕机时通知给HMaster,从而HMaster可以实现HMaster之间的failover,或对宕机的HRegionServer中的HRegion集合的修复(将它们分配给其他的HRegionServer)。ZooKeeper集群本身使用一致性协议(PAXOS协议)保证每个节点状态的一致性。

ZooKeeper协调集群所有节点的共享信息,在HMaster和HRegionServer连接到ZooKeeper后创建Ephemeral节点,并使用Heartbeat机制维持这个节点的存活状态,如果某个Ephemeral节点实效,则HMaster会收到通知,并做相应的处理。

另外,HMaster通过监听ZooKeeper中的Ephemeral节点(默认:/hbase/rs/*)来监控HRegionServer的加入和宕机。在第一个HMaster连接到ZooKeeper时会创建Ephemeral节点(默认:/hbasae/master)来表示Active的HMaster,其后加进来的HMaster则监听该Ephemeral节点,如果当前Active的HMaster宕机,则该节点消失,因而其他HMaster得到通知,而将自身转换成Active的HMaster,在变为Active的HMaster之前,它会创建在/hbase/back-masters/下创建自己的 Ephemeral节点。

HRegionServer详解

HRegionServer一般和DataNode在同一台机器上运行,实现数据的本地性。HRegionServer包含多个HRegion(0~1000),由WAL(HLog)、BlockCache、MemStore、HFile组成。

WAL

即Write Ahead Log,在早期版本中称为HLog,它是HDFS上的一个文件,如其名字所表示的,所有写操作都会先保证将数据写入这个Log文件后,才会真正更新MemStore,最后保证HRegionServer宕机后,我们依然可以从该Log文件中读取数据,Replay所有的操作,而不至于数据丢失。这个Log文件会定期Roll出新的文件而删除旧的文件(那些已持久化到HFile中的Log可以删除)。WAL文件存储在/hbase/WALs/${HRegionServer_Name}的目录中(在0.94之前,存储在/hbase/.logs/目录中),一般一个HRegionServer只有一个WAL实例,也就是说一个HRegionServer的所有WAL写都是串行的(就像log4j的日志写也是串行的),这当然会引起性能问题,因而在HBase 1.0之后,通过HBASE-5699实现了多个WAL并行写(MultiWAL),该实现采用HDFS的多个管道写,以单个 HRegion为单位。

BlockCache

是一个读缓存,即“引用局部性”原理(也应用于CPU,分空间局部性和时间局部性,空间局部性是指CPU在某一时刻需要某个数据,那么有很大的概率在一下时刻它需要的数据在其附近;时间局部性是指某个数据在被访问过一次后,它有很大的概率在不久的将来会被再次的访问),将数据预读取到内存中,以提 升读的性能。HBase中提供两种BlockCache的实现:默认on-heap LruBlockCache和BucketCache(通常是o-heap)。通常BucketCache的性能要差于LruBlockCache,然而由于GC的影响,LruBlockCache的延迟会变的不稳定,而BucketCache由于是自己管理BlockCache,而不需要GC,因而它的延迟通常比较稳定,这也是有些时候需要选用BucketCache的原因。

HRegion

是一个Table中的一个Region在一个HRegionServer中的表达。一个Table可以有一个或多个Region,他们可以在一个相同的HRegionServer上,也可以分布在不同的HRegionServer上,一个HRegionServer可以有多个HRegion,他们分别属于不同的Table。HRegion由多个Store(HStore)构成,每个HStore对应了一个Table在这个HRegion中的一个Column Family,即每个Column Family就是一个集中的存储单元,因而最好将具有相近IO特性的Column存储在一个Column Family,以实现高效读取(数据局部性原理,可以提高缓存的命中率)。HStore是HBase中存储的核心,它实现了读写HDFS功能,一个HStore由一个MemStore 和0个或多个StoreFile组成。

  • MemStore是一个写缓存(In Memory Sorted Buer),所有数据的写在完成WAL日志写后,会 写入MemStore中,由MemStore根据一定的算法将数据Flush到地层HDFS文件中(HFile),通常每个HRegion中的每个Column Family有一个自己的MemStore。
  • HFile(StoreFile) 用于存储HBase的数据(Cell/KeyValue)。在HFile中的数据是按RowKey、Column Family、Column排序,对相同的Cell(即这三个值都一样),则按timestamp倒序排列。

英文读物:https://mapr.com/blog/in-depth-look-hbase-architecture/

七、HBase预分区

HBase数据热点问题原因: 用户大量的读写请求访问HBase集群的一台或者某几台RegionServer,导致RegionServer负载压力激增,可能会引起RegionServer性能下降,更严重者导致服务挂掉;

BigTable预分区

t_user ---> r1 [null- user100] r2[user101-user1000] rn [... - null]

语法

create 'ns1:t1', 'f1', SPLITS => ['10', '20', '30', '40']创建5个预分区:r1: null-10r2: 10~20r3: 20~30r4: 30~40r5: 40~nullrowkey 8   --> r1rowkey 33  --> r4
应用

创建tt_user splits => [user1000,user2000,user3000],

hbase(main):003:0> create 'default:tt_user','cf1',SPLITS => ['user1000','user2000','user3000']0 row(s) in 1.2960 seconds=> Hbase::Table - tt_userr1: null~user1000       hadoopnode03r2: user1000~user2000   hadoopnode01r3: user2000~user3000   hadoopnode02r4: user3000~null       hadoopnode01

八、RowKey设计

设计原则: 唯一、有序、长度、散列

唯一原则

RowKey是HBase一行数据的唯一标识,必须保证唯一不重复;

有序原则

RowKey会自动按照字典顺序排序;比如:直播弹幕 ,可以设计直播间ID:timestamp

长度原则

Rowkey最大允许64字节,建议设置为16个字节以内;

50字节* 1亿记录 ≈ 4GB

  • 会造成内存资源浪费
  • 影响MemStore有效存储空间

散列原则

将数据分散存放到多个HBase RegionServer中存储; 避免数据热点问题;

  • RowKey Reverse翻转rowkey: 手机号 158xxxx0000 158xxxx0001 158xxxx1002 158xxxx3003 -------reverse------ 0000xxxx851 1000xxxx851 2001xxxx851 ...
  • RowKey + saltrowkey: abc01 abc02 abc03 abc04 -------salt------- bigtable[四个region a,b,c,d] a:abc01 b:abc02 c:abc03 d:abc04 user1000:abc01 user2000:abc02 user3000:abc03
  • RowKey + MD5rowkey: abc01 abc02 abc03 abc04 MD5(abc01) ---> 32位16进制字符串 保留前4位 3def MD5(abc02) ---> 32位16进制字符串 7d3a 3def:abc01 7d3a:abc02


hbase的大表count hbase单表最大数据量_hbase安装包_04