用Spark导入MySQL数据到HBase
企业中大规模数据存储于HBase背景:
项目中有需求,要频繁地、快速地向一个表中初始化数据。因此如何加载数据,如何提高速度是需要解决的问题。
一般来说,作为数据存储系统会分为检索和存储两部分。
检索是对外暴露数据查询接口。
存储一是要实现数据按固定规则存储到存储介质中(如磁盘、内存等),另一方面还需要向外暴露批量装载的工具。如DB2的 db2load 工具,在关闭掉日志的前提下,写入速度能有显著提高。
HBase数据库提供批量导入数据到表功能:
1、Hbase 中LoadIncrementalHFiles
支持向Hbase 写入HFile 文件
2、写入的HFile 文件要求是排序的(rowKey,列簇,列)
3、关键是绕过Hbase regionServer, 直接写入Hbase文件
4、Spark RDD的repartitionAndSortWithinPartitions
方法可以高效地实现分区并排序
5、JAVA util.TreeMap
是红黑树的实现,能很好的实现排序的要求
编写应用开发流程如下:
1、对待写入的数据按Key值构造util.TreeMap 树结构。目的是按Key值构造匹配Hbase 的排序结构
2、转换成RDD,使用repartitionAndSortWithinPartitions算子 对Key值分区并排序
3、调用RDD的saveAsNewAPIHadoopFile 算子,生成HFile文件
4、调用Hbase: LoadIncrementalHFiles 将HFile文件Load 到Hbase 表中
1,将HBase数据库中不同表的字段信息封装object对象:
package demo01
import scala.collection.immutable.TreeMap
/**
* HBase 中各个表的字段名称,存储在TreeMap中
*/
object TableFieldNames{
// TODO: 使用TreeMap为qualifier做字典序排序
// a. 行为日志数据表的字段
val LOG_FIELD_NAMES: TreeMap[String, Int] = TreeMap(
("id", 0),
("log_id", 1),
("remote_ip", 2),
("site_global_ticket", 3),
("site_global_session", 4),
("global_user_id", 5),
("cookie_text", 6),
("user_agent", 7),
("ref_url", 8),
("loc_url", 9),
("log_time", 10)
)
// b. 商品信息数据表的字段
val GOODS_FIELD_NAMES: TreeMap[String, Int] = TreeMap(
("id", 0),
("siteid", 1),
("istest", 2),
("hasread", 3),
("supportonedaylimit", 4),
("orderid", 5),
("cordersn", 6),
("isbook", 7),
("cpaymentstatus", 8),
("cpaytime", 9),
("producttype", 10),
("productid", 11),
("productname", 12),
("sku", 13),
("price", 14),
("number", 15),
("lockednumber", 16),
("unlockednumber", 17),
("productamount", 18),
("balanceamount", 19),
("couponamount", 20),
("esamount", 21),
("giftcardnumberid", 22),
("usedgiftcardamount", 23),
("couponlogid", 24),
("activityprice", 25),
("activityid", 26),
("cateid", 27),
("brandid", 28),
("netpointid", 29),
("shippingfee", 30),
("settlementstatus", 31),
("receiptorrejecttime", 32),
("iswmssku", 33),
("scode", 34),
("tscode", 35),
("tsshippingtime", 36),
("status", 37),
("productsn", 38),
("invoicenumber", 39),
("expressname", 40),
("invoiceexpressnumber", 41),
("postman", 42),
("postmanphone", 43),
("isnotice", 44),
("noticetype", 45),
("noticeremark", 46),
("noticetime", 47),
("shippingtime", 48),
("lessordersn", 49),
("waitgetlesshippinginfo", 50),
("getlesshippingcount", 51),
("outping", 52),
("lessshiptime", 53),
("closetime", 54),
("isreceipt", 55),
("ismakereceipt", 56),
("receiptnum", 57),
("receiptaddtime", 58),
("makereceipttype", 59),
("shippingmode", 60),
("lasttimeforshippingmode", 61),
("lasteditorforshippingmode", 62),
("systemremark", 63),
("tongshuaiworkid", 64),
("orderpromotionid", 65),
("orderpromotionamount", 66),
("externalsalesettingid", 67),
("recommendationid", 68),
("hassendalertnum", 69),
("isnolimitstockproduct", 70),
("hpregisterdate", 71),
("hpfaildate", 72),
("hpfinishdate", 73),
("hpreservationdate", 74),
("shippingopporunity", 75),
("istimeoutfree", 76),
("itemshareamount", 77),
("lessshiptintime", 78),
("lessshiptouttime", 79),
("cbsseccode", 80),
("points", 81),
("modified", 82),
("splitflag", 83),
("splitrelatecordersn", 84),
("channelid", 85),
("activityid2", 86),
("pdorderstatus", 87),
("omsordersn", 88),
("couponcode", 89),
("couponcodevalue", 90),
("storeid", 91),
("storetype", 92),
("stocktype", 93),
("o2otype", 94),
("brokeragetype", 95),
("ogcolor", 96)
)
// c. 用户信息数据表的字段
val USER_FIELD_NAMES: TreeMap[String, Int] = TreeMap(
("id", 0),
("siteid", 1),
("avatarimagefileid", 2),
("email", 3),
("username", 4),
("password", 5),
("salt", 6),
("registertime", 7),
("lastlogintime", 8),
("lastloginip", 9),
("memberrankid", 10),
("bigcustomerid", 11),
("lastaddressid", 12),
("lastpaymentcode", 13),
("gender", 14),
("birthday", 15),
("qq", 16),
("job", 17),
("mobile", 18),
("politicalface", 19),
("nationality", 20),
("validatecode", 21),
("pwderrcount", 22),
("source", 23),
("marriage", 24),
("money", 25),
("moneypwd", 26),
("isemailverify", 27),
("issmsverify", 28),
("smsverifycode", 29),
("emailverifycode", 30),
("verifysendcoupon", 31),
("canreceiveemail", 32),
("modified", 33),
("channelid", 34),
("grade_id", 35),
("nick_name", 36),
("is_blacklist", 37)
)
// d. 订单数据表的字段
val ORDER_FIELD_NAMES: TreeMap[String, Int] = TreeMap(
("id", 0),
("siteid", 1),
("istest", 2),
("hassync", 3),
("isbackend", 4),
("isbook", 5),
("iscod", 6),
("notautoconfirm", 7),
("ispackage", 8),
("packageid", 9),
("ordersn", 10),
("relationordersn", 11),
("memberid", 12),
("predictid", 13),
("memberemail", 14),
("addtime", 15),
("synctime", 16),
("orderstatus", 17),
("paytime", 18),
("paymentstatus", 19),
("receiptconsignee", 20),
("receiptaddress", 21),
("receiptzipcode", 22),
("receiptmobile", 23),
("productamount", 24),
("orderamount", 25),
("paidbalance", 26),
("giftcardamount", 27),
("paidamount", 28),
("shippingamount", 29),
("totalesamount", 30),
("usedcustomerbalanceamount", 31),
("customerid", 32),
("bestshippingtime", 33),
("paymentcode", 34),
("paybankcode", 35),
("paymentname", 36),
("consignee", 37),
("originregionname", 38),
("originaddress", 39),
("province", 40),
("city", 41),
("region", 42),
("street", 43),
("markbuilding", 44),
("poiid", 45),
("poiname", 46),
("regionname", 47),
("address", 48),
("zipcode", 49),
("mobile", 50),
("phone", 51),
("receiptinfo", 52),
("delayshiptime", 53),
("remark", 54),
("bankcode", 55),
("agent", 56),
("confirmtime", 57),
("firstconfirmtime", 58),
("firstconfirmperson", 59),
("finishtime", 60),
("tradesn", 61),
("signcode", 62),
("source", 63),
("sourceordersn", 64),
("onedaylimit", 65),
("logisticsmanner", 66),
("aftersalemanner", 67),
("personmanner", 68),
("visitremark", 69),
("visittime", 70),
("visitperson", 71),
("sellpeople", 72),
("sellpeoplemanner", 73),
("ordertype", 74),
("hasreadtaobaoordercomment", 75),
("memberinvoiceid", 76),
("taobaogroupid", 77),
("tradetype", 78),
("steptradestatus", 79),
("steppaidfee", 80),
("depositamount", 81),
("balanceamount", 82),
("autocanceldays", 83),
("isnolimitstockorder", 84),
("ccborderreceivedlogid", 85),
("ip", 86),
("isgiftcardorder", 87),
("giftcarddownloadpassword", 88),
("giftcardfindmobile", 89),
("autoconfirmnum", 90),
("codconfirmperson", 91),
("codconfirmtime", 92),
("codconfirmremark", 93),
("codconfirmstate", 94),
("paymentnoticeurl", 95),
("addresslon", 96),
("addresslat", 97),
("smconfirmstatus", 98),
("smconfirmtime", 99),
("smmanualtime", 100),
("smmanualremark", 101),
("istogether", 102),
("isnotconfirm", 103),
("tailpaytime", 104),
("points", 105),
("modified", 106),
("channelid", 107),
("isproducedaily", 108),
("couponcode", 109),
("couponcodevalue", 110),
("ckcode", 111)
)
}
2,批量加载数据:HBaseBulkLoader,代码如下:
package demo01
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.hbase.client.{ConnectionFactory, Table}
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapreduce.{HFileOutputFormat2, LoadIncrementalHFiles}
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.{HBaseConfiguration, KeyValue, TableName}
import org.apache.hadoop.mapreduce.Job
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.immutable.TreeMap
/**
* 将数据存储文本文件转换为HFile文件,加载到HBase表中
*/
object HBaseBulkLoader {
def main(args: Array[String]): Unit = {
// 应用执行时传递5个参数:数据类型、HBase表名称、表列簇、输入路径及输出路径
/*
args = Array("1", "tbl_tag_logs", "detail", "/user/hive/warehouse/tags_dat2.db/tbl_logs", "/datas/output_hfile/tbl_tag_logs")
args = Array("2", "tbl_tag_goods", "detail", "/user/hive/warehouse/tags_dat2.db/tbl_goods", "/datas/output_hfile/tbl_tag_goods")
args = Array("3", "tbl_tag_users", "detail", "/user/hive/warehouse/tags_dat2.db/tbl_users", "/datas/output_hfile/tbl_tag_users")
args = Array("4", "tbl_tag_orders", "detail", "/user/hive/warehouse/tags_dat2.db/tbl_orders", "/datas/output_hfile/tbl_tag_orders")
*/
if(args.length != 5){
println("Usage: required params: <DataType> <HBaseTable> <Family> <InputDir> <OutputDir>")
System.exit(-1)
}
// 将传递赋值给变量, 其中数据类型:1Log、2Good、3User、4Order
val Array(dataType, tableName, family, inputDir, outputDir) = args
// 依据参数获取处理数据schema
val fieldNames = dataType.toInt match {
case 1 => TableFieldNames.LOG_FIELD_NAMES
case 2 => TableFieldNames.GOODS_FIELD_NAMES
case 3 => TableFieldNames.USER_FIELD_NAMES
case 4 => TableFieldNames.ORDER_FIELD_NAMES
}
// 1. 构建SparkContext实例对象
val sc: SparkContext = {
// a. 创建SparkConf,设置应用配置信息
val sparkConf = new SparkConf()
.setMaster("local[2]")
.setAppName(this.getClass.getSimpleName.stripSuffix("$"))
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
// b. 传递SparkContext创建对象
SparkContext.getOrCreate(sparkConf)
}
// 2. 读取文本文件数据,转换格式
val keyValuesRDD: RDD[(ImmutableBytesWritable, KeyValue)] = sc
.textFile(inputDir)
// 过滤数据
.filter(line => null != line)
// 提取数据字段,构建二元组(RowKey, KeyValue)
/*
Key: rowkey + cf + column + version(timestamp)
Value: ColumnValue
*/
.flatMap{line => getLineToData(line, family, fieldNames)}
// TODO: 对数据做字典排序
.sortByKey()
// TODO:构建Job,设置相关配置信息,主要为输出格式
// a. 读取配置信息
val conf: Configuration = HBaseConfiguration.create()
// b. 如果输出目录存在,删除
val dfs = FileSystem.get(conf)
val outputPath: Path = new Path(outputDir)
if(dfs.exists(outputPath)){
dfs.delete(outputPath, true)
}
dfs.close()
// TODO:c. 配置HFileOutputFormat2输出
val conn = ConnectionFactory.createConnection(conf)
val htableName = TableName.valueOf(tableName)
val table: Table = conn.getTable(htableName)
HFileOutputFormat2.configureIncrementalLoad(
Job.getInstance(conf), //
table, //
conn.getRegionLocator(htableName)//
)
// TODO: 3. 保存数据为HFile文件
keyValuesRDD.saveAsNewAPIHadoopFile(
outputDir, //
classOf[ImmutableBytesWritable], //
classOf[KeyValue], //
classOf[HFileOutputFormat2], //
conf //
)
// TODO:4. 将输出HFile加载到HBase表中
val load = new LoadIncrementalHFiles(conf)
load.doBulkLoad(outputPath, conn.getAdmin, table, conn.getRegionLocator(htableName))
// 应用结束,关闭资源
sc.stop()
}
/**
* 依据不同表的数据文件,提取对应数据,封装到KeyValue对象中
*/
def getLineToData(line: String, family: String,
fieldNames: TreeMap[String, Int]): List[(ImmutableBytesWritable, KeyValue)] = {
val length = fieldNames.size
// 分割字符串
val fieldValues: Array[String] = line.split("\\t", -1)
if(null == fieldValues || fieldValues.length != length) return Nil
// 获取id,构建RowKey
val id: String = fieldValues(0)
val rowKey = Bytes.toBytes(id)
val ibw: ImmutableBytesWritable = new ImmutableBytesWritable(rowKey)
// 列簇
val columnFamily: Array[Byte] = Bytes.toBytes(family)
// 构建KeyValue对象
fieldNames.toList.map{ case (fieldName, fieldIndex) =>
// KeyValue实例对象
val keyValue = new KeyValue(
rowKey, //
columnFamily, //
Bytes.toBytes(fieldName), //
Bytes.toBytes(fieldValues(fieldIndex)) //
)
// 返回
(ibw, keyValue)
}
}
}
运行此应用程序时,需传递相关参数:
// 应用执行时传递5个参数:数据类型、HBase表名称、表列簇、输入路径及输出路径
/*
args = Array("1", "tbl_tag_logs", "detail", "/user/hive/warehouse/tags_dat2.db/tbl_logs", "/datas/output_hfile/tbl_tag_logs")
args = Array("2", "tbl_tag_goods", "detail", "/user/hive/warehouse/tags_dat2.db/tbl_goods", "/datas/output_hfile/tbl_tag_goods")
args = Array("3", "tbl_tag_users", "detail", "/user/hive/warehouse/tags_dat2.db/tbl_users", "/datas/output_hfile/tbl_tag_users")
args = Array("4", "tbl_tag_orders", "detail", "/user/hive/warehouse/tags_dat2.db/tbl_orders", "/datas/output_hfile/tbl_tag_orders")
*/
// IDEA中测试执行,传递应用程序参数
// a. 行为日志
1 tbl_tag_logs detail /user/hive/warehouse/tags_dat2.db/tbl_logs /datas/output_hfile/tbl_logs
// b. 商品信息
2 tbl_tag_goods detail /user/hive/warehouse/tags_dat2.db/tbl_goods /datas/output_hfile/tbl_goods
// c. 用户信息
3 tbl_tag_users detail /user/hive/warehouse/tags_dat2.db/tbl_users /datas/output_hfile/tbl_users
// d. 订单数据
4 tbl_tag_orders detail /user/hive/warehouse/tags_dat2.db/tbl_orders /datas/output_hfile/tbl_orders
Spark Bulkload常见错误解析:
1,调用 saveAsNewAPIHadoopFile
方法抛出 “Added a key not lexically larger than previous” 的异常是因为排序问题导致
2,ImmutableBytesWritable
无法序列化的异常。通过如下 java 代码设置即可(也可以用 scala 实现)
sparkConf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer");
sparkConf.registerKryoClasses(newClass[{org.apache.hadoop.hbase.io.ImmutableBytesWritable.class});
3,比较器无法序列化的异常。让比较器实现 Serializable 接口即可。
4,driver 中初始化的对象 于在 RDD 的 action 或者 transformation 中无法获取的异常,需要做 broadcast。
5,可以通过动态设置 executor 个数来优化整体任务执行效率。