故障发生背景和错误日志
分享一次DataFrame故障复现和解决
现有如下任务:多个小表与大表join后新产生的表有很多空值,使用window函数对空值进行分组填充。
任务中途中断,抛出oom错误。
截取抛出来的主要的错误日志,日志的内容如下:
19/05/16 10:11:39 WARN TaskMemoryManager: leak 32.0 KB memory from org.apache.spark.shuffle.sort.ShuffleExternalSorter@567bac86
19/05/16 10:11:39 WARN TaskMemoryManager: leak a page: org.apache.spark.unsafe.memory.MemoryBlock@f338fda in task 4214
19/05/16 10:11:39 ERROR Executor: Exception in task 58.0 in stage 24.0 (TID 4214)
java.lang.OutOfMemoryError: Unable to acquire 16384 bytes of memory, got 0
at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:100)
at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.(UnsafeInMemorySorter.java:111)
at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:153)
at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:120)
at org.apache.spark.sql.execution.WindowExec$$anonfun$15$$anon$1.fetchNextPartition(WindowExec.scala:339)
at org.apache.spark.sql.execution.WindowExec$$anonfun$15$$anon$1.next(WindowExec.scala:390)
at org.apache.spark.sql.execution.WindowExec$$anonfun$15$$anon$1.next(WindowExec.scala:289)
日志分析:
错误主要涉及如下三部分:
- 内存管理系统中MemoryConsumer获取不到更多的内存,导致ShuffleExternalSorter内存泄漏
- UnsafeExternalSorter spill排序好的数据到disk缓解内存压力,但是该操作依然没有解决Executor超内存
- WindowExec 进行shuffle 和sort超内存
结合日志能初步判断导致错误的直接原因是executor内存不够用,接下来结合实际的业务代码和部分操作实现的源码,分析错误产生的原因。
故障代码
def fill_null(df):
"""
先前向填充,然后用标准Y舱价格补充空值
:param df:
:return:
"""
## 前向补充bulkmask,mileage的空值
# 定义窗的范围
window_befeore = Window.partitionBy(["adept", "adest", "ptdLabel", "schdate"]) \
.orderBy('generated_date') \
.rowsBetween(-sys.maxsize, 0)
# 定义前向填充的列
full_bulkmask = last(df['bulkmask'], ignorenulls=True).over(window_befeore)
filled_mileage = last(df['mileage'], ignorenulls=True).over(window_befeore)
filled_price_Y = last(df['price_Y'], ignorenulls=True).over(window_befeore)
# 填充
df = df.withColumn('full_bulkmask', full_bulkmask) \
.withColumn('filled_mileage', filled_mileage) \
.withColumn('filled_price_Y', filled_price_Y)
##后向补充继续缺失的值
# 定义窗的范围
window_after = Window.partitionBy(["adept", "adest", "ptdLabel", "schdate"]) \
.orderBy(desc('generated_date')) \
.rowsBetween(-sys.maxsize, 0)
# 定义后向填充的列
full_mileage = last(df['filled_mileage'], ignorenulls=True).over(window_after)
full_price_Y = last(df['filled_price_Y'], ignorenulls=True).over(window_after)
# 填充
df = df.withColumn('full_price_Y', full_price_Y) \
.withColumn('full_mileage', full_mileage)
# 将generated_date较早且空缺的数据用full_price_Y填充
df = df.withColumn("full_bulkmask",
when(df["full_bulkmask"].isNull(), df["full_price_Y"]).otherwise(df["full_bulkmask"]))
# 添加full_price_Y
_columns1 = [c for c in df.columns if
c not in {"bulkmask", "mileage", "price_Y", "filled_price_Y", "filled_mileage"}]
df = df.select(_columns1)
return df
定位故障代码:在调用自定义填充函数使用window函数计算时,抛出oom错误。
故障分析
从故障表现来看executor中的MemoryConsumer获取不到内存直接导致报错,下面是MemoryConsumer.java中抛出oom的源码:
private void throwOom(final MemoryBlock page, final long required) {
long got = 0;
if (page != null) {
got = page.size();
taskMemoryManager.freePage(page, this);
}
taskMemoryManager.showMemoryUsage();
// checkstyle.off: RegexpSinglelineJava
throw new SparkOutOfMemoryError("Unable to acquire " + required + " bytes of memory, got " +
got);
// checkstyle.on: RegexpSinglelineJava
}
ShuffleExternalSorter存储数据的最小单位是MemoryBlock,每一个MemoryBlock正是TaskMemoryManager泄漏的32kb,而MemoryBlock内存泄漏给出一个反馈“管理的Executor中再也挤不出内存了”。在统一内存管理模式下(Spark内存管理可以参考我先前的一篇博客: ),如果executor的内存如果运行内存或者存储内存过大,会将排好序的数据溢出到硬盘中(详情可参考UnsafeExternalSorter.java)并且释放掉这部分内存(详情可见freeMemory函数),这种运行方式能确保有限的内存资源运算较大的数量,由于对堆内对象的分配和释放是由 JVM 管理的,而 Spark 是通过随机采样获取已经使用的内存情况,有可能因为数据量大且采样不准确而不能及时 Spill导致OOM,所以Spill不及时和Spill的量不够疑似导致超内存的原因之一。
从增大executor内存的方向出发,增大spark.memory.fraction配置参数(尽量使用默认0.6,增加这个值可能会导致垃圾回收压力增加,GC报错)一定程度上能扩大Executor有效使用的内存,但对本案例中的结果无影响。最后推断,执行WindowExec中使用window函数对数据排序的时候导致executor内存欠缺较多,下面重点分析window func源代码。
UnsafeExternalSorter.java
/**
* Sort and spill the current records in response to memory pressure.
*/
@Override
public long spill(long size, MemoryConsumer trigger) throws IOException {
if (trigger != this) {
if (readingIterator != null) {
return readingIterator.spill();
}
return 0L; // this should throw exception
}
if (inMemSorter == null || inMemSorter.numRecords() <= 0) {
return 0L;
}
logger.info("Thread {} spilling sort data of {} to disk ({} {} so far)",
Thread.currentThread().getId(),
Utils.bytesToString(getMemoryUsage()),
spillWriters.size(),
spillWriters.size() > 1 ? " times" : " time");
ShuffleWriteMetrics writeMetrics = new ShuffleWriteMetrics();
final UnsafeSorterSpillWriter spillWriter =
new UnsafeSorterSpillWriter(blockManager, fileBufferSizeBytes, writeMetrics,
inMemSorter.numRecords());
spillWriters.add(spillWriter);
spillIterator(inMemSorter.getSortedIterator(), spillWriter);
final long spillSize = freeMemory();
// Note that this is more-or-less going to be a multiple of the page size, so wasted space in
// pages will currently be counted as memory spilled even though that space isn't actually
// written to disk. This also counts the space needed to store the sorter's pointer array.
inMemSorter.reset();
// Reset the in-memory sorter's pointer array only after freeing up the memory pages holding the
// records. Otherwise, if the task is over allocated memory, then without freeing the memory
// pages, we might not be able to get memory for the pointer array.
taskContext.taskMetrics().incMemoryBytesSpilled(spillSize);
taskContext.taskMetrics().incDiskBytesSpilled(writeMetrics.bytesWritten());
totalSpillBytes += spillSize;
return spillSize;
}
释放相应block的内存:
/**
* Free this sorter's data pages.
*
* @return the number of bytes freed.
*/
private long freeMemory() {
updatePeakMemoryUsed();
long memoryFreed = 0;
for (MemoryBlock block : allocatedPages) {
memoryFreed += block.size();
freePage(block);
}
allocatedPages.clear();
currentPage = null;
pageCursor = 0;
return memoryFreed;
}
Spark中的Window类
在检错和纠错前之前先对Window类剖析,Window.scala是Window的构造。
Window.scala
object Window {
/**
* Creates a [[WindowSpec]] with the partitioning defined.
* @since 1.4.0
*/
@scala.annotation.varargs
def partitionBy(cols: Column*): WindowSpec = {
spec.partitionBy(cols : _*)
}
/**
* Creates a [[WindowSpec]] with the ordering defined.
* @since 1.4.0
*/
@scala.annotation.varargs
def orderBy(cols: Column*): WindowSpec = {
spec.orderBy(cols : _*)
}
/* @param start boundary start, inclusive. The frame is unbounded if this is
* the minimum long value (`Window.unboundedPreceding`).
* @param end boundary end, inclusive. The frame is unbounded if this is the
* maximum long value (`Window.unboundedFollowing`).
* @since 2.1.0
*/
// Note: when updating the doc for this method, also update WindowSpec.rowsBetween.
def rowsBetween(start: Long, end: Long): WindowSpec = {
spec.rowsBetween(start, end)
}
/* @param start boundary start, inclusive. The frame is unbounded if this is
* the minimum long value (`Window.unboundedPreceding`).
* @param end boundary end, inclusive. The frame is unbounded if this is the
* maximum long value (`Window.unboundedFollowing`).
* @since 2.1.0
*/
// Note: when updating the doc for this method, also update WindowSpec.rangeBetween.
def rangeBetween(start: Long, end: Long): WindowSpec = {
spec.rangeBetween(start, end)
}
Window由如源代码所示,类早期实现了partitionBy和orderBy方法,2.1版本又增加了rowsBetween和rangeBetween这2个方法,虽然构造方法比较简单、读者也比较通俗易懂,但是这些方法用于在DataFrames中能定义很多实用的人窗口程序函数。
/**
* A window function calculates the results of a number of window functions for a window frame.
* Before use a frame must be prepared by passing it all the rows in the current partition. After
* preparation the update method can be called to fill the output rows.
*/
abstract class WindowFunctionFrame {
/**
* Prepare the frame for calculating the results for a partition.
*
* @param rows to calculate the frame results for.
*/
def prepare(rows: ExternalAppendOnlyUnsafeRowArray): Unit
/**
* Write the current results to the target row.
*/
def write(index: Int, current: InternalRow): Unit
}
window功能函数在计算时以window frame为一个基础单位,准备好基础数据,即可将结果以新的列形式添加到原有frame的右侧。需要警惕的是window是一个代价比较昂贵的操作,操作时需要将一个分组中的所有数据放入同一个排好序的partition中,不同的frame可以并行操作。详情可以参考下面的doExecute。
/**
* This class calculates and outputs (windowed) aggregates over the rows in a single (sorted)
* partition. The aggregates are calculated for each row in the group. Special processing
* instructions, frames, are used to calculate these aggregates. Frames are processed in the order
* specified in the window specification (the ORDER BY ... clause).
* This is quite an expensive operator because every row for a single group must be in the same
* partition and partitions must be sorted according to the grouping and sort order. The operator
* requires the planner to take care of the partitioning and sorting.
*/
protected override def doExecute(): RDD[InternalRow] = {
// Unwrap the expressions and factories from the map.
val expressions = windowFrameExpressionFactoryPairs.flatMap(_._1)
val factories = windowFrameExpressionFactoryPairs.map(_._2).toArray
val inMemoryThreshold = sqlContext.conf.windowExecBufferInMemoryThreshold
val spillThreshold = sqlContext.conf.windowExecBufferSpillThreshold
// Start processing.
child.execute().mapPartitions { stream =>
new Iterator[InternalRow] {
// Get all relevant projections.
val result = createResultProjection(expressions)
val grouping = UnsafeProjection.create(partitionSpec, child.output)
...
...
fetchNextRow()
// Manage the current partition.
val buffer: ExternalAppendOnlyUnsafeRowArray =
new ExternalAppendOnlyUnsafeRowArray(inMemoryThreshold, spillThreshold)
var bufferIterator: Iterator[UnsafeRow] = _
val windowFunctionResult = new SpecificInternalRow(expressions.map(_.dataType))
val frames = factories.map(_(windowFunctionResult))
val numFrames = frames.length
private[this] def fetchNextPartition() {
// Collect all the rows in the current partition.
// Before we start to fetch new input rows, make a copy of nextGroup.
val currentGroup = nextGroup.copy()
// clear last partition
buffer.clear()
while (nextRowAvailable && nextGroup == currentGroup) {
buffer.add(nextRow)
fetchNextRow()
}
...
...
// 'Merge' the input row with the window function result
join(current, windowFunctionResult)
rowIndex += 1
// Return the projection.
result(join)
基于对window函数的分析可以得出一个初步的结论:在对分组的frame的数据排序时需要将同一个节点和不同节点的同一个分组数据repartition;在这个操作无疑会涉及shuffle,shuflle操作内存开销比较大(10个executor中的oom错误9个都是由于shuffle导致,而且shuffle也是spark运算中的性能瓶颈)在静态内存管理模式中使用提高spark.shuffle.memoryFraction这个参数的值能增大execution实际使用内存的上限,但是目前使用的Spark2.0以上版本默认是动态内存管理,理论上execution的内存最大值为jvm分配的executor内存,所以调节该参数没有意义。综上分析有如下解决方案。
故障解决方案
方案一:最简单的方法是,在集群资源充足的前提下,可以适当提高Executor内存,足够大的内存空间能预防Executor内存出错,此外增加partition的数量能预防数据倾斜导致的内存报错。
方案二:在本案例中,使用window实现填充函数本意是,进行数据填充的时候同时能保留原有的列信息;doExecute确实能保证这种操作,但是在进入fill_null函数之前的数据列已经高达23列,而实际参与排序的数据列仅有9列;这些数据列全部进入默认frame默认进行排序操作,这些额外的列无疑会增加运算的负担。从完成任务的角度,调用fill_null函数前将数据分成“参与填充列”、“额外列”2个部分,完成填充后实现这两个部分数据的合并操作。实践证明,相同配置条件下,减少列操作的方法能解决executor超内存报错。
方案三:在2.2版本以后可以调整spark.maxRemoteBlockSizeFetchToMem 参数,降低spill阈值,增加spill频率减少内存压力。
结论
spark任务重操作较大量的数据,使用window函数需要谨慎处理数据的partition以及对数据的排序操作;在能完成任务的前提下可以考虑其他可替换的算子。