Java spark 重分区 spark jdbc 分区

转载

mob6454cc7796a7 2023-10-26 14:17:32

文章标签 Java spark 重分区 spark jdbc 数据库 bc 文章分类 Java 后端开发

摘要

本篇文章主要分析spark sql在加载jdbc数据时，比如通过jdbc方式加载MySQL数据时，分区数如何确定，以及每个分区加载的数据范围。通过本篇文章的分析，以后我们在用spark读取jdbc数据时，能够大致明白底层干了什么事情，以及避免一些坑。

spark dataframe的jdbc接口

/**
   * Construct a `DataFrame` representing the database table accessible via JDBC URL
   * url named table. Partitions of the table will be retrieved in parallel based on the parameters
   * passed to this function.
   *
   * Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash
   * your external database systems.
   *
   * @param url JDBC database url of the form `jdbc:subprotocol:subname`.
   * @param table Name of the table in the external database.
   * @param columnName the name of a column of numeric, date, or timestamp type
   *                   that will be used for partitioning.
   * @param lowerBound the minimum value of `columnName` used to decide partition stride.
   * @param upperBound the maximum value of `columnName` used to decide partition stride.
   * @param numPartitions the number of partitions. This, along with `lowerBound` (inclusive),
   *                      `upperBound` (exclusive), form partition strides for generated WHERE
   *                      clause expressions used to split the column `columnName` evenly. When
   *                      the input is less than 1, the number is set to 1.
   * @param connectionProperties JDBC database connection arguments, a list of arbitrary string
   *                             tag/value. Normally at least a "user" and "password" property
   *                             should be included. "fetchsize" can be used to control the
   *                             number of rows per fetch and "queryTimeout" can be used to wait
   *                             for a Statement object to execute to the given number of seconds.
   * @since 1.4.0
   */
  def jdbc(
      url: String,
      table: String,
      columnName: String,
      lowerBound: Long,
      upperBound: Long,
      numPartitions: Int,
      connectionProperties: Properties)

多分区的坑

上面的那个方法说明有一句很重要的话

Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems.

就是说加载jdbc数据时，不要创建太多的并行分区，否则spark搞死你的jdbc数据源，比如你的MySQL。假设这样一个场景，在一个大型spark集群上，我们的spark并行度可达到成百上千，然后我们加载一个小规模MySQL数据库的数据时，又指定了成百上千个分区，当这样的程序运行时，就会同时有成百上千个链接去连接MySQL，然后并行地加载大量数据，面对这样高的负载，我们的MySQL一定会因为太高负载而崩溃的。

这种情况在我的实际生产上是的确发生过的，当时我用spark程序去加载别的部门的MySQL数据，spark程序的并行度也不高，只有十几左右，他们的MySQL是单节点的，只做了简单的主从架构。给我们开通数据权限后，不到2天，他们的MySQL就频繁地出现慢查询。最后他们就拒绝我们用spark程序读他们的库了，我们这边就不得不通过python小批量滚动的方式读取我们需要的数据。这次事件还好没有酿成严重的生产事故，但也的确足够引起我们重视了。

分区相关参数

上面说了jdbc多分区的坑，本部分重点说一说jdbc数据加载的分区如何确定。在上面jdbc方法中，有这么4个参数

columnName: String,
lowerBound: Long,
upperBound: Long,
numPartitions: Int

他们就是和分区逻辑紧密相关的，columnName是用于分区的字段，就是我们再切割划分分区时，按照什么字段的取值来作为分区的依据，lowerBound是分区字段取值的下限(范围包含)，upperBound是下限(不包含)，numPatitions这个最好理解，我们希望按照多少分区来加载jdbc数据，这个参数是我们的希望值，实际不一定有用，具体原因，我们下面分析

2个重要的类

spark读写jdbc数据的逻辑是通过JdbcRelationProvider 这个类来实现的，JdbcRelationProvider 主要提供了读写jdbc数据的接口方法，更具体的逻辑是依靠JDBCRelation这个类来实现的，JdbcRelationProvider定义以及读写jdbc的方法如下

class JdbcRelationProvider extends CreatableRelationProvider
  with RelationProvider with DataSourceRegister {

  override def shortName(): String = "jdbc"

  // 读取jdbc数据方法
  override def createRelation(
      sqlContext: SQLContext,
      parameters: Map[String, String]): BaseRelation = {
    val jdbcOptions = new JDBCOptions(parameters)
    val resolver = sqlContext.conf.resolver
    val timeZoneId = sqlContext.conf.sessionLocalTimeZone
    val schema = JDBCRelation.getSchema(resolver, jdbcOptions)
    val parts = JDBCRelation.columnPartition(schema, resolver, timeZoneId, jdbcOptions)
    JDBCRelation(schema, parts, jdbcOptions)(sqlContext.sparkSession)
  }

// 写jdbc方法
 override def createRelation(
      sqlContext: SQLContext,
      mode: SaveMode,
      parameters: Map[String, String],
      df: DataFrame)

本篇文章我们重点分析读jdbc的逻辑

分区数的确定

在上面的读取jdbc数据的方法里面，我们很明白地看到了这样一行代码

val parts = JDBCRelation.columnPartition(schema, resolver, timeZoneId, jdbcOptions)

没错它就是用来确定分区的。我们来走进它的内部实现，该方法有点长，我这里只挑重点的代码来分析

// 先将用户传入的参数设为默认值
      val partitionColumn = jdbcOptions.partitionColumn
      val lowerBound = jdbcOptions.lowerBound
      val upperBound = jdbcOptions.upperBound
      val numPartitions = jdbcOptions.numPartitions

      ....
      ....
      // 真正确定分区数的逻辑
        val numPartitions =
      if ((upperBound - lowerBound) >= partitioning.numPartitions || /* check for overflow */
        (upperBound - lowerBound) < 0) {
        partitioning.numPartitions
      } else {
        logWarning(" 打印信息，我们这里省略")
        upperBound - lowerBound
      }

从代码当中我们，我们很明晰地了解到，通过比较分区字段取值的上限和下限取值，来确定分区数。如果上下限差值大于等于默认分区数，那么默认的分区数就是最后的分区数，否则，上下限差值就是分区数。来思考下，为何要这么做呢？

其实很容易就能理解，比如我们要加载5条数据，却想整10个分区，传入的分区数参数是10，想想最终的分区数确定下来是多少呢？是5！就算每个分区只加载1条数据，那也只需要5个分区，剩下5个不干活，那就没有创建它的必要了。只有当我们的数据条数大于等于分区数参数，这个分区数参数才有意义。

每个分区加载数据范围确定

// 分区宽度确定，即每个分区应该加载的数据条数 
    val stride: Long = upperBound / numPartitions - lowerBound / numPartitions

    var i: Int = 0
    val column = partitioning.column
    var currentValue = lowerBound
    val ans = new ArrayBuffer[Partition]()
    while (i < numPartitions) {
      val lBoundValue = boundValueToString(currentValue)
      val lBound = if (i != 0) s"$column >= $lBoundValue" else null
      currentValue += stride
      val uBoundValue = boundValueToString(currentValue)
      val uBound = if (i != numPartitions - 1) s"$column < $uBoundValue" else null
      val whereClause =
        if (uBound == null) {
          lBound
        } else if (lBound == null) {
          s"$uBound or $column is null"
        } else {
          s"$lBound AND $uBound"
        }
      ans += JDBCPartition(whereClause, i)
      i = i + 1
    }
    val partitions = ans.toArray

先通过上下限和分区数，来确定落在每个分区的数据条数stride，然后除开第0个和最后一个分区有点特殊，中间的每个分区都加载属于自己的stride条数据。第0个分区加载分区字段值小于stride或者取值为null的所有数据，最后一个分区，假设其分区id是i,则它加载分区字段取值大于等于stride * i 的所有数据。

逻辑验证测试

分析了上面的分区数据确定逻辑，我们通过程序来验证下，我们指定分区字段为id，下限取值是2(包含)，上限取值是5(不包含)。分区数是3，有如下的验证程序，

val data = spark.read.jdbc(url, table, "id", 2, 5, 3,buildProperties())
        .selectExpr("id","appkey","funnel_name")
    data.show(100, false)

我们预期3个分区，每个分区加载1条数据，但实际的分区在日志看到是这样的，第0个分区没有考虑下限，最后一个分区没有考虑上限，和我们上面分析底层源码得出的结论一致

20/08/05 16:58:59 INFO JDBCRelation: Number of partitions: 3, WHERE clauses of these partitions: `id` < 3 or `id` is null, `id` >= 3 AND `id` < 4, `id` >= 4

加载数据结果

+---+---------------+----------------+
|id |appkey         |funnel_name     |
+---+---------------+----------------+
|0  |donews         |付款漏斗         |
|2  |donews         |提交订单漏斗     |
|3  |donews         |选择漏斗         |
|4  |donews         |222hhh           |
|5  |test_user_value|付款漏斗         |
|6  |test_user_value|提交订单漏斗      |
|7  |yanming        |测试             |
|8  |xingkong       |测试漏斗20161121 |
|9  |xingkong       |网站测试         |
|12 |xingkong       |登录新鲜事漏斗   |
+---+---------------+----------------+

它竟然将所有数据都加载出来了, 也就是第0个分区和最后一个分区加载的数据并没有被限制在上下限之内。如果我们不分析它的底层逻辑，是很容易踩进这个坑的