hbase hdfs 小文件 hbase shell 文件

转载

mob64ca1403c772 2023-08-18 21:55:33

文章标签 hbase hdfs 小文件 nosql mapreduce apache hadoop 文章分类 Hbase 数据库

一、基础操作

# 启动关闭
启动hbase：./bin/start-hbase.sh  
停止hbase：./bin/stop-hbase.sh
启动shell脚本：./bin/hbase shell
命令查看：help
查看当前登录用户及用户组：whoami
# 命名空间
创建命名空间：create_namespace 'ns1', {'PROPERTY_NAME'=>'PROPERTY_VALUE'}
查看命名空间的描述信息：describe_namespace 'ns1'
查看所有命名空间：list_namespace
查看指定命名空间下所有表：list_namespace_tables 'ns1'
删除命名空间：drop_namespace 'ns1'（前提命名空间为空）
查看所有表：list
# 创建表
创建表：create 'test', { NAME => 'cf', VERSIONS => 3 , COMPRESSION => 'SNAPPY', DATA_BLOCK_ENCODING => 'FAST_DIFF', BLOOMFILTER=>'ROW'},{SPLITS => [ '1','2','3', '4','5','6','7','8','9']}
查看表结构：describe 'test'
# 修改
修改表结构：alter 'test', {NAME => 'cf'}, {NAME => 'cf1', METHOD => 'delete'}
# 插入
插入数据：put 'test', 'row1', 'cf:a', 'value1'
#查询LIMIT要大写
查询表数据：scan 'test',{LIMIT=>3}
查询单条数据：get 'test', 'row1','cf:a'
# 大批量数据效率很慢，1亿条数据能跑半个小时
#INTERVAL一次显示多少行及对应的rowkey，默认1000；CACHE每次去取的缓存区大小，默认是10，调整参数可提高查询速度
统计表行数：count 'test',{INTERVAL => 1000000, CACHE => 100000}    
# 这种方式比较快跑mr的，还有一种基于协同处理器的速度最快
统计表行数：hbase org.apache.hadoop.hbase.mapreduce.RowCounter '表名'   效率较高，跑的MapReduce任务
# 删除相关
删除表某列：delete 'test','row1','cf:a',timestamp
删除表整行：deleteall 'test','row1'
清空表数据：truncate 'test'
禁用表：disable 'test'
启用表：enable 'test'
删除表：drop 'test' （先禁用表才能删除）
判断表是否存在：exists 'test'
退出shell：exit

二、快照操作

# 快照相关
创建快照：snapshot  'test', 'snapshot_test'
查看所有快照：list_snapshots
查看指定表的快照：list_table_snapshots 'test' 						 
删除快照：delete_snapshot 'snapshot_test'
删除指定表的快照：delete_table_snapshots 'test'
删除所有快照：delete_all_snapshot
恢复指定快照：restore_snapshot 'snapshot_test'（原表需要先disable禁用）
根据快照创建新表：clone_snapshot 'snapshot_test','test_new'（不会涉及数据的移动，复制表结构然后与数据建立引用）
使用ExportSnapshot命令将A集群的快照数据迁移到B集群：
hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot \
    -snapshot snapshot_test -copy-from hdfs://server1:8082/hbase \
    -copy-to hdfs://server2:50070/hbase -mappers 16 -bandwidth  2048\

使用快照clone新表，使用export导出表数据，在使用import导入新表来实现hbase多张表的合并。

三、export和import操作

HBase表数据导出：
hbase org.apache.hadoop.hbase.mapreduce.Export 
-Dhbase.export.scanner.batch=2000  指定批次
-D mapred.output.compress=true     开启压缩
test 							   指定表名
/hbase/test 		   			   指定导出路径（添加file://就是导出到本地文件）

hdfs数据导入HBase表：
hbase org.apache.hadoop.hbase.mapreduce.Driver import \
-Dimport.bulk.output=./test/outPut \  此路径为bulk load 临时路径 
-Dmapreduce.map.speculative=true \
-Dmapreduce.reduce.speculative=true \ 
test \
/hbase/test/*   (指定导入路径这里是hdfs路径)
注意：注意：数据量大时，会报RegionTooBusyException异常，添加-Dimport.bulk.output=./test/outPut参数开启bulk load 加载（默认是put加载） 


# 导出详细参数：
hbase org.apache.hadoop.hbase.mapreduce.Export
数据压缩属性：
   -D mapreduce.output.fileoutputformat.compress=true   开启压缩
   -D mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec  指定压缩格式
   -D mapreduce.output.fileoutputformat.compress.type=BLOCK    压缩类型
扫描属性：
   -D hbase.mapreduce.scan.column.family=<family1>,<family2>, ...   指定列族
   -D hbase.mapreduce.include.deleted.rows=true
   -D hbase.mapreduce.scan.row.start=<ROWSTART>   起始rowkey
   -D hbase.mapreduce.scan.row.stop=<ROWSTOP>	  结束rowkey
   -D hbase.client.scanner.caching=100            缓存
   -D hbase.export.visibility.labels=<labels>			
分批属性：
   -D hbase.export.scanner.batch=10        
   -D hbase.export.scanner.caching=100
   -D mapreduce.job.name=jobName - use the specified mapreduce job name for the export
开日推测执行：
   -D mapreduce.map.speculative=false
   -D mapreduce.reduce.speculative=false

# 导入详细参数：
hbase org.apache.hadoop.hbase.mapreduce.Driver import test /hbase/test/*
批量加载数据:
  -Dimport.bulk.output=/path/for/output
大的结果集，可能会OOM：
  -Dimport.bulk.hasLargeResult=true
过滤输入：
  -Dimport.filter.class=<name of filter class>
  -Dimport.filter.args=<comma separated list of args for filter>
要导入从HBase 0.94导出的数据，请使用
  -Dhbase.import.version=0.94
  -D mapreduce.job.name=jobName - 使用指定的mapreduce作业名进行导入
优化性能:
  -Dmapreduce.map.speculative=false
  -Dmapreduce.reduce.speculative=false
  -Dimport.wal.durability=SKIP_WAL/ASYNC_WAL/SYNC_WAL

四、CopyTable操作

hbase org.apache.hadoop.hbase.mapreduce.CopyTable  
CopyTable可以将表的一部分或全部复制到同一个集群或另一个集群。
详细参数：
CopyTable [general options] [--starttime=X] [--endtime=Y] [--new.name=NEW] [--peer.adr=ADR] <tablename>

例子:
hbase org.apache.hadoop.hbase.mapreduce.CopyTable \
--starttime=1265875194289 \
--endtime=1265878794289 \
--peer.adr=server1,server2,server3:2181:/hbase \
--families=myOldCf:myNewCf,cf2,cf3 \
TestTable 
参数选项:
 rs.class     指定是否与当前集群不同
 rs.impl      hbase.regionserver.impl of the peer cluster
 startrow     起始rowkey
 stoprow      结束rowkey
 starttime    起始时间戳，没指定end的话就是从开始一直到结尾
 endtime      结束时间戳，没指定start不生效
 versions     复制数据的版本数
 new.name     新表的表名
 peer.adr     指定zookeeper集群地址 hbase.zookeeper.quorum:hbase.zookeeper.client.port:zookeeper.znode.parent
 families     表的列族，要从cf1复制到cf2，指定sourceCfName:destCfName ，如果名称一直只需CfName一个即可
 all.cells    also copy delete markers and deleted cells
 bulkload     批量加载数据
 tablename    要复制的表的名称
-Dhbase.client.scanner.caching=100 		建议设置为100，值越高使用内存越大，可以减少到服务器的往返时间，提高性能。
-Dmapreduce.map.speculative=false    	建议设置为false，防止写入数据两次，产生不准确的结果。

五、ImportTsv导入Tsv文件

ImportTsv是一个将TSV格式的数据加载到HBase的工具。它有两个不同的用途:通过Puts将HDFS中的TSV格式的数据加载到HBase中，以及通过completebulkload加载存储文件到hbase。

hbase org.apache.hadoop.hbase.mapreduce.ImportTsv \
-Dimporttsv.bulk.output=/path/for/output \
-Dimporttsv.columns=a,b,c \
<tablename> \
<inputdir> 

必须使用-Dimporttsv.columns指定TSV数据的列名，多列之间采用逗号分隔，如果一个列族直接写列名，否则这种形式columnfamily:qualifier。
-Dimporttsv.columns=a,b,c <tablename> <inputdir>

columnHBASE_TS_KEY		指定该列作为每个记录的时间戳。（可选的）
HBASE_ROW_KEY  			用于指定该列作为每个导入记录的行键。必须指定一个列作为行键，此列必须为输入数据中存在列名。
HBASE_CELL_TTL 			指定此列将用作单元格的生存时间(TTL)属性。
HBASE_CELL_VISIBILITY 	指定此列包含可见性标签表达式。
HBASE_ATTRIBUTES_KEY 	可用于指定每个记录的操作属性。应该指定key的值，当为-1时表示被使用

批量加载数据开启bulk load:
  -Dimporttsv.bulk.output=/path/for/output
其他可选参数：
  -Dimporttsv.dry.run=true - 空运行模式。数据不会写到表中。如果表不存在，则创建表但是会删除它
  -Dimporttsv.skip.bad.lines=false - 如果遇到无效行，则失败
  -Dimporttsv.log.bad.lines=true - 将无效行记录到标准derr日志中
  -Dimporttsv.skip.empty.columns=false - 大容量导入时参数值为true，则跳过其中的空列
  -Dimporttsv.separator=| - 分隔符为管道符
  -Dimporttsv.timestamp=currentTimeAsLong - 使用指定的导入时间戳
  -Dimporttsv.mapper.class=my.Mapper - 自定义映射器mapper
  -Dmapreduce.job.name=jobName - 使用指定的mapreduce作业名进行导入
  -Dcreate.table=no - 是否创建表，为no时目标表必须存在
  -Dno.strict=true - hbase表中忽略列族检查。默认的是false
性能优化：
  -Dmapreduce.map.speculative=false      map端推测执行
  -Dmapreduce.reduce.speculative=false   reduce端推测执行

六、LoadIncrementalHFiles操作

completebulkload 工具可以将产生的存储文件移动到HBase表。与ImportTsv联合使用，当ImportTsv使用-Dimporttsv.bulk.output输出时使用此工具将文件导入hbase表中：
hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles <hdfs://storefileoutput> <tablename>
 -Dcreate.table=no - 不创建表
 -Dignore.unmatched.families=yes - 忽略不匹配的列族

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。