文章目录
- 背景
- 测试条件
- 结论
- 代码
- 1.PutList
- 2.saveAsNewAPIHadoopDataset
- 3.BulkLoad
- 测试中出现的问题汇总
- 1.Exception in thread “main” java.lang.IllegalArgumentException: Can not create a Path from an empty string
- 2.java.io.IOException: java.io.IOException: Wrong FS:
- 3.Call exception, tries=10, retries=35, started=48972 ms ago, cancelled=false, msg=row '’xxxxx…
背景
最近项目里有实时写入Hbase的操作,在选择写入方式时,我陷入了抉择,到底该用哪种写入方式呢?都说用BulkLoad的方法写入最好,但是是针对数据量很大的时候,如果数据量小还会不会很快呢?还有网上有的文章说PutList比saveAsNewAPIHadoopDataset方式快,有的却说saveAsNewAPIHadoopDataset的快;所以…纸上得来终觉浅,绝知此事要躬行,那就亲自试一下吧。
测试条件
1.测试数据量分为10W和100W
2.每个方法测试次数为三次
3.都是写入Hbase同一个表
4.Hbase版本1.2.0;Hadoop版本2.6.0;Spark版本2.3.0
结论
先上结论:
我们可以看到在10W的数据下,PutList的方法是最快的,其次是BulkLoad方法,最后是saveAsNewAPIHadoopDataset方法;
在100W数据下,BulkLoad的性能明显体现出来,saveAsNewAPIHadoopDataset方法也要比PutList快了,在10W数据下,saveAsNewAPIHadoopDataset方法是要比PutList慢很多的,但是100W数据下,saveAsNewAPIHadoopDataset方法就反超了PutList
所以:
1.在数据量很大的情况下,用BulkLoad;
2.在数据量比较小的情况下,用Put/PutList性价比会更高
3.数据量不大不小时,根据自己的业务场景和实现的难易,选择合适的方法
代码
1.PutList
import org.apache.hadoop.hbase.{HBaseConfiguration, TableName}
import org.apache.hadoop.hbase.client.{Connection, ConnectionFactory}
import java.util.{ArrayList, List, Random}
import org.apache.hadoop.hbase.client._
import java.text.DecimalFormat
object PutList {
val df2: DecimalFormat = new DecimalFormat("00")
def main(args: Array[String]): Unit = {
val tableName: String = "test_speed"
val conn = getConn
val putList = getPutList()
val start: Long = System.currentTimeMillis
insertBatchData(conn,tableName,putList)
val end: Long = System.currentTimeMillis
System.out.println("用时:" + (end - start))
}
def getConn(): Connection = {
val conf = HBaseConfiguration.create
conf.set("hbase.zookeeper.quorum", "xxx")
conf.set("hbase.zookeeper.property.clientPort", "2181")
ConnectionFactory.createConnection(conf)
}
def insertBatchData(conn: Connection, tableName: String, puts:List[Put]) = try {
val tableNameObj = TableName.valueOf(tableName)
val table = conn.getTable(tableNameObj)
table.put(puts)
table.close()
} catch {
case e: Exception =>
e.printStackTrace()
}
def getPutList(): List[Put] = {
val random: Random = new Random
val putlist = new ArrayList[Put]();
for (i <- 0 until 1000000) {
val rowkey: String = df2.format(random.nextInt(99)) + i
val put: Put = new Put(rowkey.getBytes)
put.addColumn("cf".getBytes, "field".getBytes, "a".getBytes)
putlist.add(put)
}
putlist
}
}
2.saveAsNewAPIHadoopDataset
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.mapreduce.TableOutputFormat
import org.apache.hadoop.hbase.client._
import java.text.DecimalFormat
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.io.Text
import org.apache.spark.sql.SparkSession
import scala.collection.mutable.ListBuffer
object newApi {
val tableName = "test_speed"
val cf = "cf"
val num=1000000
val df2 = new DecimalFormat("00000000")
def main(args: Array[String]) = {
val sc = getSparkSession().sparkContext
val conf = HBaseConfiguration.create
conf.set("hbase.zookeeper.quorum", "xxx")
conf.set("hbase.zookeeper.property.clientPort", "2181")
conf.set(TableOutputFormat.OUTPUT_TABLE, tableName)
val jobConf = new Configuration(conf)
// 设置表名
jobConf.set("mapreduce.job.outputformat.class", classOf[TableOutputFormat[Text]].getName)
var list = ListBuffer[Put]()
println("数据准备中。。。。")
for (i <- 0 to num) {
val put = new Put(df2.format(i).getBytes())
put.addColumn(cf.getBytes(), "field".getBytes(), "abc".getBytes())
list.append(put)
}
println("数据准备完成!")
val data = sc.makeRDD(list.toList).map(x => {
(new ImmutableBytesWritable, x)
})
val start = System.currentTimeMillis()
data.saveAsNewAPIHadoopDataset(jobConf)
val end = System.currentTimeMillis()
println("入库用时:" + (end - start))
sc.stop()
}
def getSparkSession(): SparkSession = {
SparkSession.builder().
appName("SparkToHbase").
//master("local[*]").
config("spark.serializer", "org.apache.spark.serializer.KryoSerializer").
getOrCreate()
}
}
3.BulkLoad
import org.apache.hadoop.fs.Path
import org.apache.hadoop.hbase.client.{ConnectionFactory, HTable, Table}
import org.apache.hadoop.hbase.{HBaseConfiguration, KeyValue, TableName}
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapreduce.{HFileOutputFormat2, LoadIncrementalHFiles}
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.mapreduce.Job
import org.apache.spark.sql.SparkSession
object bulkLoad {
def main(args: Array[String]): Unit = {
val sc = getSparkSession().sparkContext
val conf = HBaseConfiguration.create
conf.set("hbase.zookeeper.quorum", "xxx")
conf.set("hbase.zookeeper.property.clientPort", "2181")
val file1 = "/user/xxx/xxxx"
//因为文件中数据有点问题,需要过滤,但是已经保证了数据量还是一样的
val source=sc.textFile(file1).filter{
x=>{
val splited = x.trim.split(",")
var rowkey = ""
var cf = ""
var clomn = ""
var value = ""
try {
rowkey = splited(0)
cf = splited(1)
clomn = splited(2)
value = splited(3)
true
}catch {
case e: Throwable =>e.printStackTrace()
false
}
}
}
val source1 = source.map(x => {
val splited = x.trim.split(",")
var rowkey = splited(0)
var cf = splited(1)
var clomn = splited(2)
var value = splited(3)
(rowkey, cf, clomn, value)
})
val rdd = source1.map(x => {
//将rdd转换成HFile需要的格式,我们上面定义了Hfile的key是ImmutableBytesWritable,那么我们定义的RDD也是要以ImmutableBytesWritable的实例为key
//KeyValue的实例为value
//rowkey
val rowKey = x._1
val family = x._2
val colum = x._3
val value = x._4
(new ImmutableBytesWritable(Bytes.toBytes(rowKey)), new KeyValue(Bytes.toBytes(rowKey), Bytes.toBytes(family), Bytes.toBytes(colum), Bytes.toBytes(value)))
})
//生成的HFile的临时保存路径
val stagingFolder = "/user/xxxx/temp"
//开始即那个HFile导入到Hbase,此处都是hbase的api操作
val load = new LoadIncrementalHFiles(conf)
//hbase的表名
val tableName = "test_speed"
//创建hbase的链接,利用默认的配置文件,实际上读取的hbase的master地址
val conn = ConnectionFactory.createConnection(conf)
//根据表名获取表
val table: Table = conn.getTable(TableName.valueOf(tableName))
try {
//创建一个hadoop的mapreduce的job
val job = Job.getInstance(conf)
//设置job名称
job.setJobName("DumpFile")
//此处最重要,需要设置文件输出的key,因为我们要生成HFil,所以outkey要用ImmutableBytesWritable
job.setMapOutputKeyClass(classOf[ImmutableBytesWritable])
//输出文件的内容KeyValue
job.setMapOutputValueClass(classOf[KeyValue])
//配置HFileOutputFormat2的信息
HFileOutputFormat2.configureIncrementalLoadMap(job, table)
//将日志保存到指定目录
rdd.saveAsNewAPIHadoopFile(stagingFolder,
classOf[ImmutableBytesWritable],
classOf[KeyValue],
classOf[HFileOutputFormat2],
conf)
//此处运行完成之后,在stagingFolder会有我们生成的Hfile文件
//开始导入
val start=System.currentTimeMillis()
load.doBulkLoad(new Path(stagingFolder), table.asInstanceOf[HTable])
val end=System.currentTimeMillis()
println("用时:"+(end-start)+"毫秒!")
} finally {
table.close()
conn.close()
}
}
def getSparkSession(): SparkSession = {
SparkSession.builder().
appName("SparkToHbase").
//master("local[*]").
config("spark.serializer", "org.apache.spark.serializer.KryoSerializer").
getOrCreate()
}
}
测试中出现的问题汇总
1.Exception in thread “main” java.lang.IllegalArgumentException: Can not create a Path from an empty string
原因:测试 Bulkload方法时源文件有问题,存在空行,所以在用saveAsNewAPIHadoopFile方法时,提示有空的字符串。
2.java.io.IOException: java.io.IOException: Wrong FS:
原因:在windows下跑BulkLoad下会报此错,可以参考:
http://blog.cheyo.net/99.html
3.Call exception, tries=10, retries=35, started=48972 ms ago, cancelled=false, msg=row '’xxxxx…