本篇属于Hadoop系列环境搭建系列,腾讯云或百度云上都有许多搭建好的环境可以直接用。不过亲自动手实践一下,收获肯定会更多一些。
目录
(1)软件环境准备
(2)HBASE安装及配置
(1)软件环境准备
Hadoop运行环境:即环境中已经能运行Hadoop。可以参见我的上一篇博文:
超详细的Hadoop3.1.2架构单机、伪分布式、完全分布式安装和配置:
Hbase安装包:可以在http://mirror.bit.edu.cn/apache/hadoop/common/下载
zookeeper安装包:所用版本为3.4.6
(2)安装zookeeper
首先来安装zookeeper,将soft目录中的安装tar包解压至家目录下的modules中,与hadoop安装目录一致:
压完成后,接下来就是环境变量及配置文件的修改了。首先将zookeeper设置一下环境变量,即与前述组件安装一样,将zookeeper的路径设置到环境变量bashrc文件中。
#setting for zookeeper
export ZOOKEEPER_HOME=/home/hadoop/modules/zookeeper-3.4.6
export PATH=$PATH:$ZOOKEEPER_HOME/bin
设置完成后,要使用source命令使其生效。
zookeeper的配置文件在安装目录文件夹下的conf中。
将zoo_sample.cfg重命名为zoo.cfg,然后设置其中的数据目录及日志存放目录,如下红色字体。同时注册节点及端口号。本实验环境为伪分布式,所以节点也只有一个。
# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
# do not use /tmp for storage, /tmp here is just
# example sakes.
dataDir=/home/hadoop/tmp/zookeeper/data
dataLogDir=/home/hadoop/tmp/zookeeper/logs
# the port at which the clients will connect
clientPort=2181
# the maximum number of client connections.
# increase this if you need to handle more clients
#maxClientCnxns=60
#
# Be sure to read the maintenance section of the
# administrator guide before turning on autopurge.
#
# http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance
# The number of snapshots to retain in dataDir
#autopurge.snapRetainCount=3
# Purge task interval in hours
# Set to "0" to disable auto purge feature
#autopurge.purgeInterval=1
server.1=big01:2888:3888
编辑完成后,保存zoo.cfg即可。
然后就可以启动zookeeper服务了。到安装目录下的bin文件夹里,如下图:
zkServer.sh脚本,命令行输入:./zkServer.sh start,启动zookeeper服务。
可以在./zkServer.sh后跟上status,查看zookeeper进程状态:
(3)HBASE安装及配置
1. 将安装包解压:
[hadoop@master ~]$ tar -zxvf hbase-2.2.0-bin.tar.gz
2. 设置环境变量:
[root@master ~]# vi /etc/profile
#setting for hbase
export HBASE_HOME=/home/hadoop/hbase-2.2.0
export PATH=$HBASE_HOME/bin:$PATH
保存后,使用source /etc/profile使其生效。
3. 配置文件修改,进入hbase安装目录下的conf文件夹,主要修改hbase-env.sh和hbase-site.xml文件、regionservers
[hadoop@master]$ cd hbase-2.2.0/conf
[hadoop@master]$ vi hbase-env.sh
# The java implementation to use. Java 1.8+ required.
export JAVA_HOME=/home/hadoop/jdk1.8.0_11
# Extra Java CLASSPATH elements. Optional.
export HBASE_CLASSPATH=/home/hadoop/hbase-2.2.0/conf
继续修改hbase-site.xml文件:
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://master:9000/hbase</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>master</value>
<description>The directory shared by RegionServers.
</description>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.master.info.port</name>
<value>16010</value>
</property>
<property>
<name>hbase.unsafe.stream.capability.enforce</name>
<value>false</value>
</property>
</configuration>
修改regionservers文件设置zookeeper节点
使用vi命令打开该文件,默认为localhost,修改为本机器的hostname即可。实践时本机器的hostname为master。
4. 启动HBASE, 在bin目录下启动./start-hbase.sh
[hadoop@master bin]$ ./start-hbase.sh
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-3.1.2/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/hbase-2.2.0/lib/client-facing-thirdparty/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-3.1.2/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/hbase-2.2.0/lib/client-facing-thirdparty/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
localhost: running zookeeper, logging to /home/hadoop/hbase-2.2.0/bin/../logs/hbase-hadoop-zookeeper-master.out
running master, logging to /home/hadoop/hbase-2.2.0/bin/../logs/hbase-hadoop-master-master.out
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-3.1.2/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/hbase-2.2.0/lib/client-facing-thirdparty/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
: running regionserver, logging to /home/hadoop/hbase-2.2.0/bin/../logs/hbase-hadoop-regionserver-master.out
查看进程如下:
[hadoop@master bin]$ jps
10595 NodeManager
19224 HMaster
10473 ResourceManager
10090 DataNode
19642 Jps
9947 NameNode
19371 HRegionServer
19167 HQuorumPeer
其中的Hmaster、HRegionServer和HQuorumPeer都是Hbase的进程,表明已经正常启动了。
5. 可以从web界面查看状态。在hbase-site.xm文件中设置了端口号为16010,因此可以在外部浏览器里输入IP地址和端口号:
常见错误:
1. 如果报错出现:
java.lang.IllegalStateException: The procedure WAL relies on the ability to hsync for proper operation during component failures, but the underlying filesystem does not support doing so. Please check the config value of 'hbase.procedure.store.wal.use.hsync' to set the desired level of robustness and ensure the config value of 'hbase.wal.dir' points to a FileSystem mount that can provide it.
则需要在hbase-site.xml增加如下配置:
<property>
<name>hbase.unsafe.stream.capability.enforce</name>
<value>false</value>
</property>2. 如果出现运行hbase shell时:ERROR: KeeperErrorCode = NoNode for /hbase/meta-region-serve
说明zookeeper没正常启动
3. 如果运行hbase时出现:Java.lang.NoClassDefFoundError: Could not initialize class org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputHelper
则需要在hbase-site.xml增加如下配置:
<property>
<name>hbase.wal.provider</name>
<value>filesystem</value>
</property>
(3)Hbase使用测试
Hbase是一种典型的NOSQL数据库,与Redis、Mongodb等类似,没有严格的数据模型定义。不像关系型数据库,模型定义完备,满足各种范式要求,然后一行一行存储和读取,Hbase则是以列来存储和读取,每一列有列名、列号和列值,同时还有版本号,也就是这一列的值可以存储好几个版本,HBase专门用于大数据的分布式存储。所以除非有真正大数据量的需求,HBASE发挥他的特长,一般的数据量来使用hbase还是有点浪费的。
1. 在当前用户目录下输入hbase shell命令,进入hbase操作:
[hadoop@master ~]$ hbase shell
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-3.1.2/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/hbase-2.2.0/lib/client-facing-thirdparty/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
HBase Shell
Use "help" to get list of supported commands.
Use "exit" to quit this interactive shell.
For Reference, please visit: http://hbase.apache.org/2.0/book.html#shell
Version 2.2.0, rUnknown, Tue Jun 11 04:30:30 UTC 2019
Took 0.0016 seconds
hbase(main):001:0> exit
出现了hbase(main):001.0>就可以在后面输入相关操作命令了。
可以敲入help,看看相关帮助:
hbase(main):003:0> help
HBase Shell, version 2.2.0, rUnknown, Tue Jun 11 04:30:30 UTC 2019
Type 'help "COMMAND"', (e.g. 'help "get"' -- the quotes are necessary) for help on a specific command.
Commands are grouped. Type 'help "COMMAND_GROUP"', (e.g. 'help "general"') for help on a command group.
COMMAND GROUPS:
Group name: general
Commands: processlist, status, table_help, version, whoami
Group name: ddl
Commands: alter, alter_async, alter_status, clone_table_schema, create, describe, disable, disable_all, drop, drop_all, enable, enable_all, exists, get_table, is_disabled, is_enabled, list, list_regions, locate_region, show_filters
Group name: namespace
Commands: alter_namespace, create_namespace, describe_namespace, drop_namespace, list_namespace, list_namespace_tables
Group name: dml
Commands: append, count, delete, deleteall, get, get_counter, get_splits, incr, put, scan, truncate, truncate_preserve
2. 新建namespace。在hbase里其数据库名以namespace来代替,可以看做一个业务或项目名称集合来理解。因此在新建时使用的是create _namespace,查看现有的namespace使用list_namespace命令,删除是drop_namespace。
hbase(main):004:0> list_namespace
NAMESPACE
default
hbase
stuinfo
3 row(s)
Took 0.0254 seconds
hbase(main):005:0> create_namespace 'sinaWeiboData'
Took 0.4083 seconds
hbase(main):006:0> list_namespace
NAMESPACE
default
hbase
sinaWeiboData
stuinfo
4 row(s)
Took 0.0186 seconds
3. namespace中相关表操作。有了业务名称namespace如sinaWeiboData,就可以添加相关记录表了。由于是列存储方式,因此这里新建就是列名。
使用create ’namespace:表名', 列名1,列名2方式。创建成功后,可以使用describe方式来查看结构
hbase(main):010:0> create 'sinaWeiboData:logs','user','record'
Created table sinaWeiboData:logs
Took 2.4328 seconds
=> Hbase::Table - sinaWeiboData:logs
hbase(main):011:0> list
TABLE
sinaWeiboData:logs
user
2 row(s)
Took 0.0075 seconds
=> ["sinaWeiboData:logs", "user"]
hbase(main):013:0> describe 'sinaWeiboData:logs'
Table sinaWeiboData:logs is ENABLED
sinaWeiboData:logs
COLUMN FAMILIES DESCRIPTION
{NAME => 'record', VERSIONS => '1', EVICT_BLOCKS_ON_CLOSE => 'false', NEW_VERSION_BEHAVIOR => 'false', KEEP_DELETED_CELLS => 'FALSE', CACHE_DATA_ON_WR
ITE => 'false', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOOMFILTER => 'ROW', CACHE_INDEX_ON_W
RITE => 'false', IN_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'NONE', BLOCKCACHE => 'tru
e', BLOCKSIZE => '65536'}
{NAME => 'user', VERSIONS => '1', EVICT_BLOCKS_ON_CLOSE => 'false', NEW_VERSION_BEHAVIOR => 'false', KEEP_DELETED_CELLS => 'FALSE', CACHE_DATA_ON_WRIT
E => 'false', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOOMFILTER => 'ROW', CACHE_INDEX_ON_WRI
TE => 'false', IN_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'NONE', BLOCKCACHE => 'true'
, BLOCKSIZE => '65536'}
2 row(s)
QUOTAS
0 row(s)
Took 0.3043 seconds
4. 新建列记录。HBase中用put命令添加数据,注意:一次只能为一个表的一行数据的一个列,也就是一个单元格添加一个数据,所以直接用shell命令插入数据效率很低,在实际应用中,一般都是利用编程操作数据。
例如先往logs表里增加第一行的列名为user的第一个记录,然后再增加第一列的列名为record的第一条记录
hbase(main):002:0> put 'sinaWeiboData:logs','1001','user','caojianhua'
Took 0.2171 seconds
hbase(main):003:0> put 'sinaWeiboData:logs','1001','record','visiting all the news and be focused by 333 fans'
Took 0.0106 seconds
5. 查看记录。使用get命令来查看。格式参考:get 表名、列名,行键。
hbase(main):004:0> get 'sinaWeiboData:logs','1001'
COLUMN CELL
record: timestamp=1581038283677, value=visiting all the news and be focused by 333 fans
user: timestamp=1581038245344, value=caojianhua
1 row(s)
Took 0.0481 seconds
也可以使用scan扫描来获取,不过这个在数据较多时较为耗时:
hbase(main):005:0> scan 'sinaWeiboData:logs'
ROW COLUMN+CELL
1001 column=record:, timestamp=1581038283677, value=visiting all the news and be focused by 333 fans
1001 column=user:, timestamp=1581038245344, value=caojianhua
1 row(s)
Took 0.0463 seconds