故障发生背景和错误日志
分享一次DataFrame故障复现和解决

现有如下任务:多个小表与大表join后新产生的表有很多空值,使用window函数对空值进行分组填充。
任务中途中断,抛出oom错误。
截取抛出来的主要的错误日志,日志的内容如下:

19/05/16 10:11:39 WARN TaskMemoryManager: leak 32.0 KB memory from org.apache.spark.shuffle.sort.ShuffleExternalSorter@567bac86
19/05/16 10:11:39 WARN TaskMemoryManager: leak a page: org.apache.spark.unsafe.memory.MemoryBlock@f338fda in task 4214
19/05/16 10:11:39 ERROR Executor: Exception in task 58.0 in stage 24.0 (TID 4214)
java.lang.OutOfMemoryError: Unable to acquire 16384 bytes of memory, got 0
	at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:100)
	at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.(UnsafeInMemorySorter.java:111)
	at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:153)
	at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:120)
	at org.apache.spark.sql.execution.WindowExec$$anonfun$15$$anon$1.fetchNextPartition(WindowExec.scala:339)
	at org.apache.spark.sql.execution.WindowExec$$anonfun$15$$anon$1.next(WindowExec.scala:390)
	at org.apache.spark.sql.execution.WindowExec$$anonfun$15$$anon$1.next(WindowExec.scala:289)

日志分析:
错误主要涉及如下三部分:

  1. 内存管理系统中MemoryConsumer获取不到更多的内存,导致ShuffleExternalSorter内存泄漏
  2. UnsafeExternalSorter spill排序好的数据到disk缓解内存压力,但是该操作依然没有解决Executor超内存
  3. WindowExec 进行shuffle 和sort超内存
    结合日志能初步判断导致错误的直接原因是executor内存不够用,接下来结合实际的业务代码和部分操作实现的源码,分析错误产生的原因。
故障代码
def fill_null(df):
    """
    先前向填充,然后用标准Y舱价格补充空值
    :param df:
    :return:
    """
    ## 前向补充bulkmask,mileage的空值
    # 定义窗的范围
    window_befeore = Window.partitionBy(["adept", "adest", "ptdLabel", "schdate"]) \
        .orderBy('generated_date') \
        .rowsBetween(-sys.maxsize, 0)
    # 定义前向填充的列
    full_bulkmask = last(df['bulkmask'], ignorenulls=True).over(window_befeore)
    filled_mileage = last(df['mileage'], ignorenulls=True).over(window_befeore)
    filled_price_Y = last(df['price_Y'], ignorenulls=True).over(window_befeore)
    # 填充
    df = df.withColumn('full_bulkmask', full_bulkmask) \
        .withColumn('filled_mileage', filled_mileage) \
        .withColumn('filled_price_Y', filled_price_Y)

    ##后向补充继续缺失的值
    # 定义窗的范围
    window_after = Window.partitionBy(["adept", "adest", "ptdLabel", "schdate"]) \
        .orderBy(desc('generated_date')) \
        .rowsBetween(-sys.maxsize, 0)
    # 定义后向填充的列
    full_mileage = last(df['filled_mileage'], ignorenulls=True).over(window_after)
    full_price_Y = last(df['filled_price_Y'], ignorenulls=True).over(window_after)
    # 填充
    df = df.withColumn('full_price_Y', full_price_Y) \
        .withColumn('full_mileage', full_mileage)
    # 将generated_date较早且空缺的数据用full_price_Y填充
    df = df.withColumn("full_bulkmask",
                       when(df["full_bulkmask"].isNull(), df["full_price_Y"]).otherwise(df["full_bulkmask"]))

    # 添加full_price_Y
    _columns1 = [c for c in df.columns if
                 c not in {"bulkmask", "mileage", "price_Y", "filled_price_Y", "filled_mileage"}]
    df = df.select(_columns1)
    return df

定位故障代码:在调用自定义填充函数使用window函数计算时,抛出oom错误。

故障分析

从故障表现来看executor中的MemoryConsumer获取不到内存直接导致报错,下面是MemoryConsumer.java中抛出oom的源码:

private void throwOom(final MemoryBlock page, final long required) {
    long got = 0;
    if (page != null) {
      got = page.size();
      taskMemoryManager.freePage(page, this);
    }
    taskMemoryManager.showMemoryUsage();
    // checkstyle.off: RegexpSinglelineJava
    throw new SparkOutOfMemoryError("Unable to acquire " + required + " bytes of memory, got " +
      got);
    // checkstyle.on: RegexpSinglelineJava
  }

ShuffleExternalSorter存储数据的最小单位是MemoryBlock,每一个MemoryBlock正是TaskMemoryManager泄漏的32kb,而MemoryBlock内存泄漏给出一个反馈“管理的Executor中再也挤不出内存了”。在统一内存管理模式下(Spark内存管理可以参考我先前的一篇博客: ),如果executor的内存如果运行内存或者存储内存过大,会将排好序的数据溢出到硬盘中(详情可参考UnsafeExternalSorter.java)并且释放掉这部分内存(详情可见freeMemory函数),这种运行方式能确保有限的内存资源运算较大的数量,由于对堆内对象的分配和释放是由 JVM 管理的,而 Spark 是通过随机采样获取已经使用的内存情况,有可能因为数据量大且采样不准确而不能及时 Spill导致OOM,所以Spill不及时和Spill的量不够疑似导致超内存的原因之一

从增大executor内存的方向出发,增大spark.memory.fraction配置参数(尽量使用默认0.6,增加这个值可能会导致垃圾回收压力增加,GC报错)一定程度上能扩大Executor有效使用的内存,但对本案例中的结果无影响。最后推断,执行WindowExec中使用window函数对数据排序的时候导致executor内存欠缺较多,下面重点分析window func源代码。
UnsafeExternalSorter.java

/**
   * Sort and spill the current records in response to memory pressure.
   */
@Override
public long spill(long size, MemoryConsumer trigger) throws IOException {
    if (trigger != this) {
      if (readingIterator != null) {
        return readingIterator.spill();
      }
      return 0L; // this should throw exception
    }

    if (inMemSorter == null || inMemSorter.numRecords() <= 0) {
      return 0L;
    }

    logger.info("Thread {} spilling sort data of {} to disk ({} {} so far)",
      Thread.currentThread().getId(),
      Utils.bytesToString(getMemoryUsage()),
      spillWriters.size(),
      spillWriters.size() > 1 ? " times" : " time");

    ShuffleWriteMetrics writeMetrics = new ShuffleWriteMetrics();

    final UnsafeSorterSpillWriter spillWriter =
      new UnsafeSorterSpillWriter(blockManager, fileBufferSizeBytes, writeMetrics,
        inMemSorter.numRecords());
    spillWriters.add(spillWriter);
    spillIterator(inMemSorter.getSortedIterator(), spillWriter);

    final long spillSize = freeMemory();
    // Note that this is more-or-less going to be a multiple of the page size, so wasted space in
    // pages will currently be counted as memory spilled even though that space isn't actually
    // written to disk. This also counts the space needed to store the sorter's pointer array.
    inMemSorter.reset();
    // Reset the in-memory sorter's pointer array only after freeing up the memory pages holding the
    // records. Otherwise, if the task is over allocated memory, then without freeing the memory
    // pages, we might not be able to get memory for the pointer array.

    taskContext.taskMetrics().incMemoryBytesSpilled(spillSize);
    taskContext.taskMetrics().incDiskBytesSpilled(writeMetrics.bytesWritten());
    totalSpillBytes += spillSize;
    return spillSize;
  }

释放相应block的内存:

/**
   * Free this sorter's data pages.
   *
   * @return the number of bytes freed.
   */
  private long freeMemory() {
    updatePeakMemoryUsed();
    long memoryFreed = 0;
    for (MemoryBlock block : allocatedPages) {
      memoryFreed += block.size();
      freePage(block);
    }
    allocatedPages.clear();
    currentPage = null;
    pageCursor = 0;
    return memoryFreed;
  }
Spark中的Window类

在检错和纠错前之前先对Window类剖析,Window.scala是Window的构造。
Window.scala

object Window {
/**
   * Creates a [[WindowSpec]] with the partitioning defined.
   * @since 1.4.0
   */
  @scala.annotation.varargs
  def partitionBy(cols: Column*): WindowSpec = {
    spec.partitionBy(cols : _*)
  }
/**
   * Creates a [[WindowSpec]] with the ordering defined.
   * @since 1.4.0
   */
  @scala.annotation.varargs
  def orderBy(cols: Column*): WindowSpec = {
    spec.orderBy(cols : _*)
  }

   /* @param start boundary start, inclusive. The frame is unbounded if this is
   *              the minimum long value (`Window.unboundedPreceding`).
   * @param end boundary end, inclusive. The frame is unbounded if this is the
   *            maximum long value (`Window.unboundedFollowing`).
   * @since 2.1.0
   */
   // Note: when updating the doc for this method, also update WindowSpec.rowsBetween.
  def rowsBetween(start: Long, end: Long): WindowSpec = {
    spec.rowsBetween(start, end)
  }
  
   /* @param start boundary start, inclusive. The frame is unbounded if this is
   *              the minimum long value (`Window.unboundedPreceding`).
   * @param end boundary end, inclusive. The frame is unbounded if this is the
   *            maximum long value (`Window.unboundedFollowing`).
   * @since 2.1.0
   */
  // Note: when updating the doc for this method, also update WindowSpec.rangeBetween.
  def rangeBetween(start: Long, end: Long): WindowSpec = {
    spec.rangeBetween(start, end)
  }

Window由如源代码所示,类早期实现了partitionBy和orderBy方法,2.1版本又增加了rowsBetween和rangeBetween这2个方法,虽然构造方法比较简单、读者也比较通俗易懂,但是这些方法用于在DataFrames中能定义很多实用的人窗口程序函数。

/**
 * A window function calculates the results of a number of window functions for a window frame.
 * Before use a frame must be prepared by passing it all the rows in the current partition. After
 * preparation the update method can be called to fill the output rows.
 */
abstract class WindowFunctionFrame {
  /**
   * Prepare the frame for calculating the results for a partition.
   *
   * @param rows to calculate the frame results for.
   */
  def prepare(rows: ExternalAppendOnlyUnsafeRowArray): Unit

  /**
   * Write the current results to the target row.
   */
  def write(index: Int, current: InternalRow): Unit
  }

window功能函数在计算时以window frame为一个基础单位,准备好基础数据,即可将结果以新的列形式添加到原有frame的右侧。需要警惕的是window是一个代价比较昂贵的操作,操作时需要将一个分组中的所有数据放入同一个排好序的partition中,不同的frame可以并行操作。详情可以参考下面的doExecute。

/**
 * This class calculates and outputs (windowed) aggregates over the rows in a single (sorted)
 * partition. The aggregates are calculated for each row in the group. Special processing
 * instructions, frames, are used to calculate these aggregates. Frames are processed in the order
 * specified in the window specification (the ORDER BY ... clause). 
 * This is quite an expensive operator because every row for a single group must be in the same
 * partition and partitions must be sorted according to the grouping and sort order. The operator
 * requires the planner to take care of the partitioning and sorting.
 */
 protected override def doExecute(): RDD[InternalRow] = {
    // Unwrap the expressions and factories from the map.
    val expressions = windowFrameExpressionFactoryPairs.flatMap(_._1)
    val factories = windowFrameExpressionFactoryPairs.map(_._2).toArray
    val inMemoryThreshold = sqlContext.conf.windowExecBufferInMemoryThreshold
    val spillThreshold = sqlContext.conf.windowExecBufferSpillThreshold

    // Start processing.
    child.execute().mapPartitions { stream =>
      new Iterator[InternalRow] {

        // Get all relevant projections.
        val result = createResultProjection(expressions)
        val grouping = UnsafeProjection.create(partitionSpec, child.output)
        ...
        ...
        fetchNextRow()
        
        // Manage the current partition.
        val buffer: ExternalAppendOnlyUnsafeRowArray =
          new ExternalAppendOnlyUnsafeRowArray(inMemoryThreshold, spillThreshold)
        var bufferIterator: Iterator[UnsafeRow] = _
        val windowFunctionResult = new SpecificInternalRow(expressions.map(_.dataType))
        val frames = factories.map(_(windowFunctionResult))
        val numFrames = frames.length
        private[this] def fetchNextPartition() {
          // Collect all the rows in the current partition.
          // Before we start to fetch new input rows, make a copy of nextGroup.
          val currentGroup = nextGroup.copy()

          // clear last partition
          buffer.clear()
          while (nextRowAvailable && nextGroup == currentGroup) {
            buffer.add(nextRow)
            fetchNextRow()
          }
		...
		...
        // 'Merge' the input row with the window function result
        join(current, windowFunctionResult)
        rowIndex += 1

        // Return the projection.
        result(join)

基于对window函数的分析可以得出一个初步的结论:在对分组的frame的数据排序时需要将同一个节点和不同节点的同一个分组数据repartition;在这个操作无疑会涉及shuffle,shuflle操作内存开销比较大(10个executor中的oom错误9个都是由于shuffle导致,而且shuffle也是spark运算中的性能瓶颈)在静态内存管理模式中使用提高spark.shuffle.memoryFraction这个参数的值能增大execution实际使用内存的上限,但是目前使用的Spark2.0以上版本默认是动态内存管理,理论上execution的内存最大值为jvm分配的executor内存,所以调节该参数没有意义。综上分析有如下解决方案。

故障解决方案

方案一:最简单的方法是,在集群资源充足的前提下,可以适当提高Executor内存,足够大的内存空间能预防Executor内存出错,此外增加partition的数量能预防数据倾斜导致的内存报错。

方案二:在本案例中,使用window实现填充函数本意是,进行数据填充的时候同时能保留原有的列信息;doExecute确实能保证这种操作,但是在进入fill_null函数之前的数据列已经高达23列,而实际参与排序的数据列仅有9列;这些数据列全部进入默认frame默认进行排序操作,这些额外的列无疑会增加运算的负担。从完成任务的角度,调用fill_null函数前将数据分成“参与填充列”、“额外列”2个部分,完成填充后实现这两个部分数据的合并操作。实践证明,相同配置条件下,减少列操作的方法能解决executor超内存报错。

方案三:在2.2版本以后可以调整spark.maxRemoteBlockSizeFetchToMem 参数,降低spill阈值,增加spill频率减少内存压力。

结论

spark任务重操作较大量的数据,使用window函数需要谨慎处理数据的partition以及对数据的排序操作;在能完成任务的前提下可以考虑其他可替换的算子。