原生的写入流程
读取数据 --> HBase的内存 --> StoreFile --> HFile --> 分裂到更多的Region中
原生的写入方法有什么问题
- 写入效率比较慢
- 由于数据量比较大,写入操作会长期占用HBase的带宽,这时候如果还有大量的读操作,可能会导致读操作变的异常缓慢,因为没有带宽读取数据
- 导致HBase的压力剧增,不断地溢写,不断地合并,不断地分裂
HBase的BulkLoad
应用场景
适合需要一次性写入大量数据的场景
流程:
- 将一批数据转换为HBase能够识别的文件格式
- 将HFile文件格式数据直接放到HBase对应的HDFS的数据目录下
优点:
不需要与HBase直接关联,对HBase的压力几乎没有,也不会占用HBase的带宽,执行效率更高效
案例:
需求:
将银行的转账记录数据加载到HBase中, 由于一次性加载数据比较多, 所以要求采用bulkLoad的方式
准备工作:
测试数据文件
链接:https://pan.baidu.com/s/1OG0dMwg3ATQY2sk7FuCdUQ?pwd=wzat
提取码:wzat
- 在hbase中创建目标表
create 'TRANSFER_RECORD','C1'
- 在idea创建一个项目,Maven引入相关jar包
<repositories>
<repository>
<id>aliyun</id>
<url>http://maven.aliyun.com/nexus/content/groups/public/</url>
<releases><enabled>true</enabled></releases>
<snapshots>
<enabled>false</enabled>
<updatePolicy>never</updatePolicy>
</snapshots>
</repository>
</repositories>
<dependencies>
<!--hbase相关依赖-->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>2.6</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-mapreduce</artifactId>
<version>2.1.0</version>
</dependency>
<!--hadoop相关依赖-->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-jobclient</artifactId>
<version>2.7.5</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.5</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>2.7.5</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-auth</artifactId>
<version>2.7.5</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.7.5</version>
</dependency>
</dependencies>
- 编写mapper类
package com.itheima.hbase.bulkload;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class BulkLoadMapper extends Mapper<LongWritable, Text, ImmutableBytesWritable, Put> {
ImmutableBytesWritable k2 = new ImmutableBytesWritable();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString().trim();
if (!"".equals(line)) {
String[] fields = line.split(",");
k2.set(fields[0].getBytes());
// 关键步骤就是将数据转换Put对象
Put v2 = new Put(fields[0].getBytes());
v2.addColumn("C1".getBytes(),"code".getBytes(),fields[1].getBytes());
v2.addColumn("C1".getBytes(),"rec_account".getBytes(),fields[2].getBytes());
v2.addColumn("C1".getBytes(),"rec_bank_name".getBytes(),fields[3].getBytes());
v2.addColumn("C1".getBytes(),"rec_name".getBytes(),fields[4].getBytes());
v2.addColumn("C1".getBytes(),"pay_account".getBytes(),fields[5].getBytes());
v2.addColumn("C1".getBytes(),"pay_name".getBytes(),fields[6].getBytes());
v2.addColumn("C1".getBytes(),"pay_comments".getBytes(),fields[7].getBytes());
v2.addColumn("C1".getBytes(),"pay_channel".getBytes(),fields[8].getBytes());
v2.addColumn("C1".getBytes(),"pay_way".getBytes(),fields[9].getBytes());
v2.addColumn("C1".getBytes(),"status".getBytes(),fields[10].getBytes());
v2.addColumn("C1".getBytes(),"timestamp".getBytes(),fields[11].getBytes());
v2.addColumn("C1".getBytes(),"money".getBytes(),fields[12].getBytes());
context.write(k2,v2);
}
}
}
- 编写driver类
package com.itheima.hbase.bulkload;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Table;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
public class BulkLoadDriver {
public static void main(String[] args) throws Exception {
System.out.println("Hello BulkLoad!");
// 配置ZK集群地址
Configuration conf = HBaseConfiguration.create();
conf.set("hbase.zookeeper.quorum","node1:2181,node2:2181,node3:2181");
Job job = Job.getInstance(conf, "BulkLoad");
job.setJarByClass(BulkLoadDriver.class);
// 配置输入路径
job.setInputFormatClass(TextInputFormat.class);
TextInputFormat.addInputPath(job,new Path("hdfs://node1:8020/hbase/input/bank_record.csv"));
// 配置mapper类及输出类型
job.setMapperClass(BulkLoadMapper.class);
job.setMapOutputKeyClass(ImmutableBytesWritable.class);
job.setMapOutputValueClass(Put.class);
// 不需要reduce,所以设置0
job.setNumReduceTasks(0);
job.setOutputKeyClass(ImmutableBytesWritable.class);
job.setOutputValueClass(Put.class);
// 配置输出类和输出文件类型
job.setOutputFormatClass(HFileOutputFormat2.class);
// 设置HFile的相关信息: 表信息 Region信息
Connection hbaseConn = ConnectionFactory.createConnection(conf);
TableName tableName = TableName.valueOf("TRANSFER_RECORD");
Table table = hbaseConn.getTable(tableName);
HFileOutputFormat2.configureIncrementalLoad(job, table, hbaseConn.getRegionLocator(tableName));
// 设置输出地址
HFileOutputFormat2.setOutputPath(job,new Path("hdfs://node1:8020/hbase/output"));
boolean flag = job.waitForCompletion(true);
System.exit(flag ? 0 : 1);
}
}
- 运行程序,到hdfs验证
- 加载HFile文件到HBase
- 语法:
hbase org.apache.hadoop.hbase.tool.LoadIncrementalHFiles MR的输出路径 HBase表名
- 案例:
hbase org.apache.hadoop.hbase.tool.LoadIncrementalHFiles hdfs://node1.itcast.cn:8020/hbase/output TRANSFER_RECORD
- 查询验证
hbase shell
count ‘TRANSFER_RECORD’