文章目录

  • 背景
  • 测试条件
  • 结论
  • 代码
  • 1.PutList
  • 2.saveAsNewAPIHadoopDataset
  • 3.BulkLoad
  • 测试中出现的问题汇总
  • 1.Exception in thread “main” java.lang.IllegalArgumentException: Can not create a Path from an empty string
  • 2.java.io.IOException: java.io.IOException: Wrong FS:
  • 3.Call exception, tries=10, retries=35, started=48972 ms ago, cancelled=false, msg=row '’xxxxx…


背景

最近项目里有实时写入Hbase的操作,在选择写入方式时,我陷入了抉择,到底该用哪种写入方式呢?都说用BulkLoad的方法写入最好,但是是针对数据量很大的时候,如果数据量小还会不会很快呢?还有网上有的文章说PutList比saveAsNewAPIHadoopDataset方式快,有的却说saveAsNewAPIHadoopDataset的快;所以…纸上得来终觉浅,绝知此事要躬行,那就亲自试一下吧。

测试条件

1.测试数据量分为10W和100W
2.每个方法测试次数为三次
3.都是写入Hbase同一个表
4.Hbase版本1.2.0;Hadoop版本2.6.0;Spark版本2.3.0

结论

先上结论:

api hbase 批量查询 hbase批量写入性能对比_api hbase 批量查询


api hbase 批量查询 hbase批量写入性能对比_api hbase 批量查询_02


我们可以看到在10W的数据下,PutList的方法是最快的,其次是BulkLoad方法,最后是saveAsNewAPIHadoopDataset方法;

在100W数据下,BulkLoad的性能明显体现出来,saveAsNewAPIHadoopDataset方法也要比PutList快了,在10W数据下,saveAsNewAPIHadoopDataset方法是要比PutList慢很多的,但是100W数据下,saveAsNewAPIHadoopDataset方法就反超了PutList

所以:
1.在数据量很大的情况下,用BulkLoad;
2.在数据量比较小的情况下,用Put/PutList性价比会更高
3.数据量不大不小时,根据自己的业务场景和实现的难易,选择合适的方法

代码

1.PutList

import org.apache.hadoop.hbase.{HBaseConfiguration, TableName}
import org.apache.hadoop.hbase.client.{Connection, ConnectionFactory}
import java.util.{ArrayList, List, Random}
import org.apache.hadoop.hbase.client._
import java.text.DecimalFormat

object PutList {
  val df2: DecimalFormat = new DecimalFormat("00")

  def main(args: Array[String]): Unit = {
    val tableName: String = "test_speed"
    val conn = getConn
    val putList = getPutList()
    val start: Long = System.currentTimeMillis
    insertBatchData(conn,tableName,putList)
    val end: Long = System.currentTimeMillis
    System.out.println("用时:" + (end - start))
  }

  def getConn(): Connection = {
    val conf = HBaseConfiguration.create
    conf.set("hbase.zookeeper.quorum", "xxx")
    conf.set("hbase.zookeeper.property.clientPort", "2181")
    ConnectionFactory.createConnection(conf)
  }

  def insertBatchData(conn: Connection, tableName: String, puts:List[Put]) = try {
    val tableNameObj = TableName.valueOf(tableName)

      val table = conn.getTable(tableNameObj)
      table.put(puts)
      table.close()

  } catch {
    case e: Exception =>
      e.printStackTrace()
  }


  def getPutList(): List[Put] = {
    val random: Random = new Random
    val putlist = new ArrayList[Put]();
    for (i <- 0 until 1000000) {
      val rowkey: String = df2.format(random.nextInt(99)) + i
      val put: Put = new Put(rowkey.getBytes)
      put.addColumn("cf".getBytes, "field".getBytes, "a".getBytes)
      putlist.add(put)
    }
    putlist
  }
}

2.saveAsNewAPIHadoopDataset

import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.mapreduce.TableOutputFormat
import org.apache.hadoop.hbase.client._
import java.text.DecimalFormat
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.io.Text
import org.apache.spark.sql.SparkSession
import scala.collection.mutable.ListBuffer

object newApi {
  val tableName = "test_speed"
  val cf = "cf"
  val num=1000000
  val df2 = new DecimalFormat("00000000")
  def main(args: Array[String]) = {
    val sc = getSparkSession().sparkContext
    val conf = HBaseConfiguration.create
    conf.set("hbase.zookeeper.quorum", "xxx")
    conf.set("hbase.zookeeper.property.clientPort", "2181")
    conf.set(TableOutputFormat.OUTPUT_TABLE, tableName)

    val jobConf = new Configuration(conf)
    // 设置表名
    jobConf.set("mapreduce.job.outputformat.class", classOf[TableOutputFormat[Text]].getName)

    var list = ListBuffer[Put]()
    println("数据准备中。。。。")
    for (i <- 0 to num) {
      val put = new Put(df2.format(i).getBytes())
      put.addColumn(cf.getBytes(), "field".getBytes(), "abc".getBytes())
      list.append(put)
    }
    println("数据准备完成!")

    val data = sc.makeRDD(list.toList).map(x => {
      (new ImmutableBytesWritable, x)
    })
    val start = System.currentTimeMillis()

    data.saveAsNewAPIHadoopDataset(jobConf)
    val end = System.currentTimeMillis()
    println("入库用时:" + (end - start))
    sc.stop()

  }

  def getSparkSession(): SparkSession = {
    SparkSession.builder().
      appName("SparkToHbase").
      //master("local[*]").
      config("spark.serializer", "org.apache.spark.serializer.KryoSerializer").
      getOrCreate()
  }
}

3.BulkLoad

import org.apache.hadoop.fs.Path
import org.apache.hadoop.hbase.client.{ConnectionFactory, HTable, Table}
import org.apache.hadoop.hbase.{HBaseConfiguration, KeyValue, TableName}
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapreduce.{HFileOutputFormat2, LoadIncrementalHFiles}
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.mapreduce.Job
import org.apache.spark.sql.SparkSession

object bulkLoad {
  def main(args: Array[String]): Unit = {

    val sc = getSparkSession().sparkContext
    val conf = HBaseConfiguration.create
    conf.set("hbase.zookeeper.quorum", "xxx")
    conf.set("hbase.zookeeper.property.clientPort", "2181")

    val file1 = "/user/xxx/xxxx"

	//因为文件中数据有点问题,需要过滤,但是已经保证了数据量还是一样的
    val source=sc.textFile(file1).filter{
      x=>{
        val splited = x.trim.split(",")
        var rowkey = ""
        var cf = ""
        var clomn = ""
        var value = ""
        try {
           rowkey = splited(0)
           cf = splited(1)
           clomn = splited(2)
           value = splited(3)
          true
        }catch {
          case e: Throwable =>e.printStackTrace()
            false
        }
      }
    }

    val source1 = source.map(x => {
      val splited = x.trim.split(",")
      var rowkey = splited(0)
      var cf = splited(1)
      var clomn = splited(2)
      var value = splited(3)
      (rowkey, cf, clomn, value)
    })

    val rdd = source1.map(x => {
      //将rdd转换成HFile需要的格式,我们上面定义了Hfile的key是ImmutableBytesWritable,那么我们定义的RDD也是要以ImmutableBytesWritable的实例为key
      //KeyValue的实例为value
      //rowkey
      val rowKey = x._1
      val family = x._2
      val colum = x._3
      val value = x._4
      (new ImmutableBytesWritable(Bytes.toBytes(rowKey)), new KeyValue(Bytes.toBytes(rowKey), Bytes.toBytes(family), Bytes.toBytes(colum), Bytes.toBytes(value)))
    })
    //生成的HFile的临时保存路径
    val stagingFolder = "/user/xxxx/temp"
    //开始即那个HFile导入到Hbase,此处都是hbase的api操作
    val load = new LoadIncrementalHFiles(conf)
    //hbase的表名
    val tableName = "test_speed"
    //创建hbase的链接,利用默认的配置文件,实际上读取的hbase的master地址
    val conn = ConnectionFactory.createConnection(conf)
    //根据表名获取表
    val table: Table = conn.getTable(TableName.valueOf(tableName))
    try {
      //创建一个hadoop的mapreduce的job
      val job = Job.getInstance(conf)
      //设置job名称
      job.setJobName("DumpFile")
      //此处最重要,需要设置文件输出的key,因为我们要生成HFil,所以outkey要用ImmutableBytesWritable
      job.setMapOutputKeyClass(classOf[ImmutableBytesWritable])
      //输出文件的内容KeyValue
      job.setMapOutputValueClass(classOf[KeyValue])
      //配置HFileOutputFormat2的信息
      HFileOutputFormat2.configureIncrementalLoadMap(job, table)
      //将日志保存到指定目录
      rdd.saveAsNewAPIHadoopFile(stagingFolder,
      classOf[ImmutableBytesWritable],
      classOf[KeyValue],
      classOf[HFileOutputFormat2],
        conf)
      //此处运行完成之后,在stagingFolder会有我们生成的Hfile文件
      //开始导入
      val start=System.currentTimeMillis()
      load.doBulkLoad(new Path(stagingFolder), table.asInstanceOf[HTable])
      val end=System.currentTimeMillis()
      println("用时:"+(end-start)+"毫秒!")
    } finally {
      table.close()
      conn.close()
    }
  }
  def getSparkSession(): SparkSession = {
    SparkSession.builder().
      appName("SparkToHbase").
      //master("local[*]").
      config("spark.serializer", "org.apache.spark.serializer.KryoSerializer").
      getOrCreate()
  }
}

测试中出现的问题汇总

1.Exception in thread “main” java.lang.IllegalArgumentException: Can not create a Path from an empty string

原因:测试 Bulkload方法时源文件有问题,存在空行,所以在用saveAsNewAPIHadoopFile方法时,提示有空的字符串。

2.java.io.IOException: java.io.IOException: Wrong FS:

原因:在windows下跑BulkLoad下会报此错,可以参考:
http://blog.cheyo.net/99.html

3.Call exception, tries=10, retries=35, started=48972 ms ago, cancelled=false, msg=row '’xxxxx…