目录

  • 一.数据源
  • 二.使用Spark SQL进行ETL
  • 三.数据落地到HBase
  • 四.读取HBase进行统计分析
  • 五.统计分析结果写入到MySQL
  • 1.使用RDD写入MySQL
  • 2.使用DataFrame写入MySQL
  • 六.调优点总结
  • 七.自己拓展
  • 1.自定义HBase数据源
  • 2.使用Kudu进行重构


一.数据源

用户行为日志:

110.85.18.234 - - [30/Jan/2019:00:00:21 +0800] "GET /course/list?c=cb HTTP/1.1" 200 12800 "www.imooc.com" "https://www.imooc.com/course/list?c=data" - "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0" "-" 10.100.16.243:80 200 0.172 0.172
218.74.48.154 - - [30/Jan/2019:00:00:22 +0800] "GET /.well-known/apple-app-site-association HTTP/1.1" 200 165 "www.imooc.com" "-" - "swcd (unknown version) CFNetwork/974.2.1 Darwin/18.0.0" "-" 10.100.135.47:80 200 0.001 0.001
113.77.139.245 - - [30/Jan/2019:00:00:22 +0800] "GET /static/img/common/new.png HTTP/1.1" 200 1020 "www.imooc.com" "https://www.imooc.com/" - "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3642.0 Safari/537.36" "-" 10.100.16.241:80 200 0.001 0.001
113.77.139.245 - - [30/Jan/2019:00:00:22 +0800] "GET /static/img/menu_icon.png HTTP/1.1" 200 4816 "www.imooc.com" "https://www.imooc.com/" - "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3642.0 Safari/537.36" "-" 10.100.16.243:80 200 0.001 0.001
106.38.241.68 - - [30/Jan/2019:00:00:22 +0800] "GET /wenda/detail/430191 HTTP/1.1" 200 8702 "www.imooc.com" "-" - "Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)" "-" 10.100.137.42:80 200 0.216 0.219
111.197.20.121 - - [30/Jan/2019:00:00:22 +0800] "GET /.well-known/apple-app-site-association HTTP/1.1" 200 165 "www.imooc.com" "-" - "swcd (unknown version) CFNetwork/893.14.2 Darwin/17.3.0" "-" 10.100.136.65:80 200 0.001 0.001
110.85.18.234 - - [30/Jan/2019:00:00:22 +0800] "GET /u/card%20?jsonpcallback=jQuery19106008894766558066_1548777623367&_=1548777623368 HTTP/1.1" 200 382 "www.imooc.com" "https://www.imooc.com/course/list?c=cb" - "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0" "-" 10.100.16.241:80 200 0.059 0.059
110.85.18.234 - - [30/Jan/2019:00:00:22 +0800] "GET /activity/newcomer HTTP/1.1" 200 444 "www.imooc.com" "https://www.imooc.com/course/list?c=cb" - "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0" "-" 10.100.136.64:80 200 0.103 0.103
194.54.83.182 - - [30/Jan/2019:00:00:22 +0800] "GET / HTTP/1.1" 301 178 "imooc.com" "-" - "Go-http-client/1.1" "-" - - - 0.000
110.85.18.234 - - [30/Jan/2019:00:00:22 +0800] "GET /u/loading HTTP/1.1" 200 64 "www.imooc.com" "https://www.imooc.com/course/list?c=cb" - "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0" "-" 10.100.16.243:80 200 0.015 0.015

由于后续业务只用到这几个字段,为了方便,只使用这几个字段。

import spark.implicits._
val logRDD: RDD[String] = spark.sparkContext.textFile("/Users/rocky/IdeaProjects/imooc-workspace/spark-project-train/src/data/test-access.log")

var logDF: DataFrame = logRDD.map(x => {
  \\...val splits: Array[String] = x.split("\"")...较为复杂的日志,需要切割多次。
  \\...val splits2: Array[String] = x.split(" ")...
  
  val ip: String = splits(0)
  val time: String = splits(3)
  val url: String = splits(4)
  val referer: String = splits(8)
  val ua: String = splits(10)
  \\\...ua解析可以放开始就做...
  (ip, datetime, url, referer, ua, browsername, browserversion, osname, osversion)
}).toDF("ip", "time", "url", "referer", "ua", "browsername", "browserversion", "osname", "osversion")

IP数据:

1.0.1.0|1.0.3.255|16777472|16778239|亚洲|中国|福建|福州||电信|350100|China|CN|119.306239|26.075302
1.0.8.0|1.0.15.255|16779264|16781311|亚洲|中国|广东|广州||电信|440100|China|CN|113.280637|23.125178
1.0.32.0|1.0.63.255|16785408|16793599|亚洲|中国|广东|广州||电信|440100|China|CN|113.280637|23.125178
1.1.0.0|1.1.0.255|16842752|16843007|亚洲|中国|福建|福州||电信|350100|China|CN|119.306239|26.075302
1.1.2.0|1.1.7.255|16843264|16844799|亚洲|中国|福建|福州||电信|350100|China|CN|119.306239|26.075302
1.1.8.0|1.1.63.255|16844800|16859135|亚洲|中国|广东|广州||电信|440100|China|CN|113.280637|23.125178
1.2.0.0|1.2.1.255|16908288|16908799|亚洲|中国|福建|福州||电信|350100|China|CN|119.306239|26.075302
val ipRowRDD: RDD[String] = spark.sparkContext.textFile("file:///Users/rocky/IdeaProjects/imooc-workspace/sparksql-train/data/ip.txt")

val ipRuleDF: DataFrame = ipRowRDD.map(x => {
  val splits: Array[String] = x.split("\\|")
  val startIP: Long = splits(2).toLong
  val endIP: Long = splits(3).toLong
  val province: String = splits(6)
  val city: String = splits(7)
  val isp: String = splits(9)
  (startIP, endIP, province, city, isp)
}).toDF("start_ip", "end_ip", "province", "city", "isp")

二.使用Spark SQL进行ETL

使用UDF,必须 import:

import org.apache.spark.sql.functions._

1.使用UDF,新增ip_long字段,使IP字段变为Long类型,方便后面和ipRuleDF进行join。

def getLongIp() = udf((ip:String) => {
  val splits: Array[String] = ip.split("[.]")
  var ipNum = 0l

  for(i<-0 until(splits.length)) {
    ipNum = splits(i).toLong | ipNum << 8L
  }
  ipNum
})

logDF = logDF.withColumn("ip_long", getLongIp()($"ip"))

2.使用UDF,使日期格式变得更加规范,变为yyyyMMddHHmm。

def formatTime() = udf((time:String) =>{
  FastDateFormat.getInstance("yyyyMMddHHmm").format(
    new Date(FastDateFormat.getInstance("dd/MMM/yyyy:HH:mm:ss Z",Locale.ENGLISH)
      .parse(time.substring(time.indexOf("[")+1, time.lastIndexOf("]"))).getTime
    ))
})

logDF = logDF.withColumn("formattime", formatTime()(logDF("time")))

3.这个没用UDF,直接来操作在生成DF前用RDD,解析User Agent字段,新增browsername,browserversion,osname,osversion这4个字段。
在Github上寻找解析包,快速开发:

<dependency>
  <groupId>cz.mallat.uasparser</groupId>
  <artifactId>uasparser</artifactId>
  <version>0.6.2</version>
</dependency>

4. logDF与ipRuleDF进行join,新增province,city,isp这3个字段。

logDF.createOrReplaceTempView("logs")
ipRuleDF.createOrReplaceTempView("ips")

val sql = SQLUtils.SQL
val result: DataFrame = spark.sql(sql)

object SQLUtils {

  lazy val SQL = "select " +
    "logs.ip ," +
    "logs.formattime," +
    "logs.url," +
    "logs.referer," +
    "logs.browsername," +
    "logs.browserversion," +
    "logs.osname," +
    "logs.osversion," +
    "logs.ua" +
    ",ips.province as provincename" +
    ",ips.city as cityname" +
    ",ips.isp as isp" +
    "from logs left join " +
    "ips on logs.ip_long between ips.start_ip and ips.end_ip "
}

三.数据落地到HBase

1.设计rowkey为day+crc32(referer+url+ip+ua),dataframe中每一行都有对应的rowkey。

val rowkey = getRowKey(day, referer+url+ip+ua)

def getRowKey(time:String, info:String) = {

  val builder = new StringBuilder(time)      //面试题:StringBuilder和StringBuffer
  builder.append("_")                        //面试题:只要是字符串拼接,尽量不要使用+

  val crc32 = new CRC32()
  crc32.reset()
  if(StringUtils.isNotEmpty(info)){
    crc32.update(Bytes.toBytes(info))
  }
  builder.append(crc32.getValue)

  builder.toString()
}

2.RDD[sql.Row] => RDD[(ImmutableBytesWritable, Put)]。
dataframe中每一行都有对应的rowkey,rowkey对应一个Put对象,每一行对应的(cf、column、value)都写到这个Put中。

val hbaseInfoRDD = logDF.rdd.map(x => {     
  val ip = x.getAs[String]("ip")
  val formattime = x.getAs[String]("formattime")
  val province = x.getAs[String]("province")
  val city = x.getAs[String]("city")
  val url = x.getAs[String]("url")
  val referer = x.getAs[String]("referer")
  val browsername = x.getAs[String]("browsername")
  val browserversion = x.getAs[String]("browserversion")
  val osname = x.getAs[String]("osname")
  val osversion = x.getAs[String]("osversion")
  val ua = x.getAs[String]("ua")

  val columns = scala.collection.mutable.HashMap[String,String]()
  columns.put("ip",ip)
  columns.put("province",province)
  columns.put("city",city)
  columns.put("formattime",formattime)
  columns.put("url",url)
  columns.put("referer",referer)
  columns.put("browsername",browsername)
  columns.put("browserversion",browserversion)
  columns.put("osname",osname)
  columns.put("osversion",osversion)

  val rowkey = getRowKey(day, referer+url+ip+ua) // HBase的rowkey
  val put = new Put(Bytes.toBytes(rowkey)) // 要保存到HBase的Put对象

  // 每一个rowkey对应的cf中的所有column字段
  for((k,v) <- columns) {
    put.addColumn(Bytes.toBytes("o"), Bytes.toBytes(k.toString), Bytes.toBytes(v.toString));
  }

  (new ImmutableBytesWritable(rowkey.getBytes), put)
})

3.写入到HBase中,每天创建一张表,有故障需要重跑的话,应该先把存在的表删了,然后重新写入。

val day = args(0)                      //从外面传入day
val input = s"hdfs://hadoop000:8020/access/$day/*"


val conf = new Configuration()
conf.set("hbase.rootdir","hdfs://hadoop000:8020/hbase")
conf.set("hbase.zookeeper.quorum","hadoop000:2181")

val tableName = createTable(day, conf)

// 设置写数据到哪个表中
conf.set(TableOutputFormat.OUTPUT_TABLE, tableName)

// 保存数据
hbaseInfoRDD.saveAsNewAPIHadoopFile(
  "hdfs://hadoop000:8020/etl/access/hbase",
  classOf[ImmutableBytesWritable],       //kClass
  classOf[Put],                          //vClass
  classOf[TableOutputFormat[ImmutableBytesWritable]],   //outputFormatClass
  conf
)

四.读取HBase进行统计分析

1.扫描HBase表,scan中加入需要扫描的列,通过Spark的newAPIHadoopRDD读取数据。

// 连接HBase
val conf = new Configuration()
conf.set("hbase.rootdir", "hdfs://hadoop000:8020/hbase")
conf.set("hbase.zookeeper.quorum", "hadoop000:2181")

val tableName = "access_" + day
conf.set(TableInputFormat.INPUT_TABLE, tableName) // 要从哪个表里面去读取数据

val scan = new Scan()

// 设置要查询的cf
scan.addFamily(Bytes.toBytes("o"))

// 设置要查询的列
scan.addColumn(Bytes.toBytes("o"), Bytes.toBytes("country"))
scan.addColumn(Bytes.toBytes("o"), Bytes.toBytes("province"))
scan.addColumn(Bytes.toBytes("o"), Bytes.toBytes("browsername"))

// 设置Scan
conf.set(TableInputFormat.SCAN, Base64.encodeBytes(ProtobufUtil.toScan(scan).toByteArray))

// 通过Spark的newAPIHadoopRDD读取数据
val hbaseRDD = spark.sparkContext.newAPIHadoopRDD(conf,
  classOf[TableInputFormat],          //inputFormatClass
  classOf[ImmutableBytesWritable],    //kClass
  classOf[Result]                     //vClass
)

2.3种方法对读取的RDD进行统计分析,每种浏览器的总数,并进行降序排序。
cache()最常用的优化点。

hbaseRDD.cache()      //最常用的优化点,缓存

//使用Spark Core进行统计
hbaseRDD.map(x => {
  val browsername = Bytes.toString(x._2.getValue("o".getBytes, "browsername".getBytes))
  (browsername, 1)
}).reduceByKey(_ + _)
  .map(x => (x._2, x._1)).sortByKey(false)
  .map(x => (x._2, x._1)).foreach(println)

//使用dataframe API进行统计
hbaseRDD.map(x => {
  val browsername = Bytes.toString(x._2.getValue("o".getBytes, "browsername".getBytes))
  Browser(browsername)
}).toDF().select("browsername").groupBy("browsername").count().orderBy(desc("count")).show(false)

//使用Spark SQL进行统计
hbaseRDD.map(x => {
  val browsername = Bytes.toString(x._2.getValue("o".getBytes, "browsername".getBytes))
  Browser(browsername)
}).toDF().createOrReplaceTempView("tmp")
spark.sql("select browsername,count(1) cnt from tmp group by browsername order by cnt desc").show(false)

case class Browser(browsername: String)

五.统计分析结果写入到MySQL

date -d"1 day ago" +"%Y%m%d"

mysql创建表:

create table if not exists browser_stat(
day varchar(10) not null,                //不需要设置主键,任务失败要重写直接删除以前的就行。
browser varchar(100) not null,
cnt int
) engine=innodb default charset=utf8;

1.使用RDD写入MySQL

resultRDD.coalesce(1).foreachPartition(part => {
  Try{
    // TODO... 将统计结果写入到MySQL
    val connection = {
      Class.forName("com.mysql.jdbc.Driver")
      val url = "jdbc:mysql://hadoop000:3306/spark?characterEncoding=UTF-8"
      val user = "root"
      val password = "root"
      DriverManager.getConnection(url, user, password)
    }

    val preAutoCommit = connection.getAutoCommit
    connection.setAutoCommit(false)

    val sql = "insert into browser_stat (day,browser,cnt) values(?,?,?)"
    val pstmt = connection.prepareStatement(sql)                 //预编译SQL语句
    pstmt.addBatch(s"delete from browser_stat where day=$day")   //先加入删除的SQL语句

    part.foreach(x => {
      pstmt.setString(1, day)
      pstmt.setString(2, x._1)
      pstmt.setInt(3,  x._2)

      pstmt.addBatch()                         //插入一个partition的数据
    })

    pstmt.executeBatch()
    connection.commit()

    (connection, preAutoCommit)
  } match {
    case Success((connection, preAutoCommit)) => {
      connection.setAutoCommit(preAutoCommit)
      if(null != connection) connection.close()
    }
    case Failure(e) => throw e
  }
})

2.使用DataFrame写入MySQL

spark写入mysql的几种方法

六.调优点总结

  1. 缓存cache()。
  2. 写入到MySQL,数量不多时,使用coalesce算子减少分区。
  3. 写入MySQL中途失败,重写时,写入前直接把当天的数据删了就行。不需要把day当主key进行约束。
  4. 不写WAL,手工刷新memstore的数据落地。
put.setDurability(Durability.SKIP_WAL) //禁用WAL


def flushTable(table:String, conf:Configuration): Unit = {

  var connection:Connection = null
  var admin:Admin = null
  try {
    connection = ConnectionFactory.createConnection(conf)
    admin = connection.getAdmin

    admin.flush(TableName.valueOf(table))           //MemStore==>StoreFile
  } catch {
    case e:Exception => e.printStackTrace()
  } finally {
    if(null != admin) {
      admin.close()
    }

    if(null != connection) {
      connection.close()
    }
  }
}
  1. 直接使用Spark将DF/RDD的数据生成HFile文件,数据Load到HBase表里。
val hbaseInfoRDD = logDF.rdd.mapPartitions(partition => {

  partition.flatMap(x=>{
		//...省略x的字段...

    val rowkey = getRowKey(day, referer+url+ip+ua)  // HBase的rowkey
    val rk = Bytes.toBytes(rowkey)
    val put = new Put(Bytes.toBytes(rowkey)) // 要保存到HBase的Put对象

    val list = new ListBuffer[((String,String),KeyValue)]()
    // 每一个rowkey对应的cf中的所有column字段
    for((k,v) <- columns) {
      val keyValue = new KeyValue(rk, "o".getBytes, Bytes.toBytes(k),Bytes.toBytes(v))
      list += (rowkey,k) -> keyValue
    }

    list.toList
  })
}).sortByKey()         //需要按(rowkey,column)进行排序
  .map(x => (new ImmutableBytesWritable(Bytes.toBytes(x._1._1)), x._2))

直接写入HBase的话,每条row是按rowkey排序的,row里面是按column排序的,直接生成HFile文件需要按(rowkey,column)进行排序,所以需要设计一个(rowkey,k) -> keyValue的数据结构。

val job = NewAPIHadoopJob.getInstance(conf)
val table = new HTable(conf, tableName)
//配置MapReduce Job,对给定的表进行增量加载。
HFileOutputFormat2.configureIncrementalLoad(job,table.getTableDescriptor,table.getRegionLocator)     

val output = "hdfs://hadoop000:8020/etl/access3/hbase"
val outputPath = new Path(output)
hbaseInfoRDD.saveAsNewAPIHadoopFile(
  output,
  classOf[ImmutableBytesWritable],     //keyClass
  classOf[KeyValue],                   //valueClass
  classOf[HFileOutputFormat2],         //outputFormatClass
  job.getConfiguration
)

if(FileSystem.get(conf).exists(outputPath)) {
  val load = new LoadIncrementalHFiles(job.getConfiguration)
  load.doBulkLoad(outputPath, table)

  FileSystem.get(conf).delete(outputPath, true)
}

七.自己拓展

1.自定义HBase数据源

通过接口可以看出,dataframe就是:RDD[Row] + schema。

package org.apache.spark.sql.sources

abstract class BaseRelation {
  def sqlContext: SQLContext
  def schema: StructType
}

@InterfaceStability.Stable
trait TableScan {
  def buildScan(): RDD[Row]
}

@InterfaceStability.Stable
trait PrunedScan {
  def buildScan(requiredColumns: Array[String]): RDD[Row]
}

@InterfaceStability.Stable
trait PrunedFilteredScan {
  def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row]
}

@InterfaceStability.Stable
trait InsertableRelation {
  def insert(data: DataFrame, overwrite: Boolean): Unit
}

定义DefaultSource类,parameters既是option中传入的参数。然后再实现HBaseRelation方法。

class DefaultSource extends RelationProvider{
  override def createRelation(sqlContext: SQLContext, parameters: Map[String, String]): BaseRelation = {
    HBaseRelation(parameters)(sqlContext)
  }
}

HBaseRelation方法需要实现schema和buildScan,即schema + RDD[Row]。

option(“spark.table.name”,"(age int,name string,sex string)")定义的schema,转化为spark的数据类型:

"age int"  => SparkSchema("age", "int") => StructField("age", IntegerType)
"name string" => SparkSchema("name", "string") => StructField("name", StringType)
"sex string" => SparkSchema("sex", "string") => StructField("sex", StringType)

hbase是没有数据类型的,hbaseRDD

hbaseRDD一条记录的value:
keyvalues={0001/o:age/1645116727964/Put/vlen=2/seqid=0, 
		   0001/o:name/1645116727901/Put/vlen=3/seqid=0, 
		   0001/o:sex/1645116728008/Put/vlen=3/seqid=0}

age_value => SparkSchema("age", "int") => Integer.parseInt(new String(result.getValue(Bytes.toBytes("o"), Bytes.toBytes("age"))))
name_value => SparkSchema("name", "string") => new String(result.getValue(Bytes.toBytes("o"), Bytes.toBytes(name)))
sex_value => SparkSchema("sex", "string") => new String(result.getValue(Bytes.toBytes("o"), Bytes.toBytes(sex)))
三个字段再转成Row
case class HBaseRelation(@transient val parameters: Map[String, String])
                        (@transient val sqlContext: SQLContext)
    extends BaseRelation with TableScan{

  override def schema: StructType = {
    val schema = sparkFields.map(field => {
      val structField = field.fieldType.toLowerCase match {
        case "string" => StructField(field.fieldName, StringType)
        case "int" => StructField(field.fieldName, IntegerType)
      }
      structField
    })
    new StructType(schema)
  }

  override def buildScan(): RDD[Row] = {
    val hbaseConf = HBaseConfiguration.create()
    hbaseConf.set("hbase.zookeeper.quorum", zookeeperAddress)
    hbaseConf.set(TableInputFormat.INPUT_TABLE, hbaseTable)
    //hbaseConf.set(TableInputFormat.SCAN_COLUMNS, "")
    val hbaseRDD = sqlContext.sparkContext.newAPIHadoopRDD(hbaseConf,
      classOf[TableInputFormat],
      classOf[ImmutableBytesWritable],
      classOf[Result]
    )

    hbaseRDD.map(_._2).map(result => {             // result = 11,zhu,nan
      val buffer = new ArrayBuffer[Any]()
      sparkFields.foreach(field => {
        field.fieldType.toLowerCase match {
          case "string" => {
            buffer += new String(result.getValue(Bytes.toBytes("o"), Bytes.toBytes(field.fieldName)))
          }
          case "int" => {
            buffer += Integer.parseInt(new String(result.getValue(Bytes.toBytes("o"), Bytes.toBytes(field.fieldName))))
          }
        }
      })
      Row.fromSeq(buffer)
    })
  }
}

此主程序只能扫描整张表,可以加入 hbaseConf.set(TableInputFormat.SCAN_COLUMNS, “”) 来获取其中几列。

val df = spark.read.format("com.chengyanban.hbase")
  .option("zookeeper.address","hadoop000:2181")
  .option("hbase.table.name","user")
  .option("spark.table.name","(age int,name string,sex string)")
  .load()
df.printSchema()
df.show()

2.使用Kudu进行重构

java spark 启动日志_java spark 启动日志

Kudu是一种列式数据存储。列式数据存储将数据存储在强类型列中。有了适当的设计,它对于分析或数据仓库工作负载有几个优势。

  1. 对于分析查询,可以读取单个列或该列的一部分,而忽略其他列。这意味着您可以在磁盘上读取最少数量的块的同时完成查询。对于基于行的存储,您需要读取整个行,即使您只从几个列返回值。
  2. 因为给定的列只包含一种类型的数据,所以基于模式的压缩比在基于行的解决方案中使用的混合数据类型的压缩效率要高几个数量级。结合从列读取数据的效率,压缩允许您在完成查询的同时从磁盘读取更少的块。

Table:在Kudu中,表是你的数据存储的地方。表有一个schema和一个完全有序的主键。一张表被分成几个部分,称为tablets。

Tablet:tablet是表的连续段,类似于其他数据存储引擎或关系数据库中的分区。一个给定的tablet被复制到多个tablet server上,在任何给定的时间点,这些副本中的一个被认为是leader tablet。任何副本都可以提供读和写服务,这需要服务于该tablet的一组tablet server之间达成一致。

Tablet Server:tablet server存储并向客户端提供tablet。对于特定的tablet,一个tablet server充当leader,其他server充当该tablet的跟随者副本。只有leader服务写请求,而leader或follower服务读请求。采用Raft Consensus Algorithm选出leader。一台tablet server可以服务多台tablet,一台tablet可以被多台tablet server所服务。

Master:master跟踪所有tablet、tablet server、Catalog Table和其他与集群相关的元数据。在给定的时间点,只能有一个代理主人(领导者)。如果当前的leader消失,则使用Raft Consensus Algorithm选出一个新的master。
master还为客户端协调元数据操作。例如,当创建一个新表时,客户端内部将请求发送给master。master将新表的元数据写入Catalog Table,并协调在tablet server上创建tablet的过程。
所有master的数据都存储在一个tablet中,可以复制到所有其他候选master上。
tablet server以设定的时间间隔(默认为每秒一次)向master发送心跳。

Catalog Table:Catalog Table是Kudu元数据的中心位置。它存储关于table和tablet的信息。Catalog Table不能直接读取或写入。相反,它只能通过在客户端API中公开的元数据操作来访问。
目录表存储两类元数据:

  • Table:table schemas, locations, and states
  • Tablets:现有tablet的列表,tablet server有每个tablet的副本,tablet的当前状态,以及 start 和 end keys.。

不用自定义数据源,很方便。

val odsDF = spark.read.format("org.apache.kudu.spark.kudu")
  .option("kudu.master", masterAdress)
  .option("kudu.table", sourceTableName)
  .load()

data.write.mode(SaveMode.Append).format("org.apache.kudu.spark.kudu")
  .option("kudu.master", master)
  .option("kudu.table", tableName)
  .save()

需设置副本数,按哪个字段进行HashPartition,并设置bucket数。

val options: CreateTableOptions = new CreateTableOptions()
options.setNumReplicas(1)

val parcols: util.LinkedList[String] = new util.LinkedList[String]()
parcols.add(partitionID)
options.addHashPartitions(parcols,3)

client.createTable(tableName, schema, options)

设置Schema信息(需设置key):

lazy val ProvinceCitySchema: Schema = {
  val columns = List(
    new ColumnSchemaBuilder("provincename",Type.STRING).nullable(false).key(true).build(),
    new ColumnSchemaBuilder("cityname",Type.STRING).nullable(false).key(true).build(),
    new ColumnSchemaBuilder("cnt",Type.INT64).nullable(false).key(true).build()
  ).asJava
  new Schema(columns)
}

参考文章:
HBase bulkLoad时间都花在哪?kudu介绍