spark 调整storage memory spark.memory.uselegacymode

转载

mob6454cc6441b6 2024-01-08 15:00:07

文章标签 spark Memory 内存管理 文章分类 Spark 大数据

spark堆内内存模型分为两种：静态内存管理StaticMemoryManager和统一内存管理UnifiedMemoryManager。

从1.6.0版本开始，Spark内存管理模型发生了变化。旧的内存管理模型由StaticMemoryManager类实现，现在称为“legacy（遗留）”。默认情况下，“Legacy”模式被禁用，这意味着在Spark 1.5.x和1.6.0上运行相同的代码会导致不同的行为。为了兼容，您可以使用spark.memory.useLegacyMode参数启用“旧”内存模型。

注 : 本文源码代码块版本为2.3.1。

源码体现在：SparkEvn.scala

val useLegacyMemoryManager = conf.getBoolean("spark.memory.useLegacyMode", false)
val memoryManager: MemoryManager =
  if (useLegacyMemoryManager) {
    new StaticMemoryManager(conf, numUsableCores)
  } else {
    UnifiedMemoryManager(conf, numUsableCores)
  }

SparkConf.scala

// Warn against deprecated memory fractions (unless legacy memory management mode is enabled)
val legacyMemoryManagementKey = "spark.memory.useLegacyMode"
val legacyMemoryManagement = getBoolean(legacyMemoryManagementKey, false)
if (!legacyMemoryManagement) {
  val keyset = deprecatedMemoryKeys.toSet
  val detected = settings.keys().asScala.filter(keyset.contains)
  if (detected.nonEmpty) {
    logWarning("Detected deprecated memory fraction settings: " +
      detected.mkString("[", ", ", "]") + ". As of Spark 1.6, execution and storage " +
      "memory management are unified. All memory fractions used in the old model are " +
      "now deprecated and no longer read. If you wish to use the old memory management, " +
      s"you may explicitly enable `$legacyMemoryManagementKey` (not recommended).")
  }
}

1.6.0及以后版本,使用的统一内存管理器，由UnifiedMemoryManager实现。

spark 调整storage memory spark.memory.uselegacymode_Memory

1、Reserved Memory

系统保留内存。从Spark 1.6.0起，它的值默认为300MB（被硬编码到spark代码中,代码private val RESERVED_SYSTEM_MEMORY_BYTES = 300 * 1024 * 1024）），这意味着这个300MB的RAM不参与Spark内存区域大小的计算。它的值可以通过参数spark.testing.reservedMemory来调整，但不推荐使用，从参数名字可以看到这个参数仅供测试使用，不建议在生产环境中使用。虽然是保留内存，但并不意味这这部分内存没有使用，实际上，它会存储大量的Spark内部对象。而且，executor在执行时，必须保证至少有1.5 *保留内存= 450MB大小的堆(systemMemory)，否则程序将失败。

2、User Memory

这是在Spark Memory分配后仍然保留的内存，使用方式完全取决于你。你可以存储RDD转换中用到的数据结构。例如，您可以在mapPartitions算子时，维护一个HashMap(这将消耗所谓的用户内存)。在Spark 1.6.0之后，此内存池的大小可以计算为（systemMemory - reservedMemory）*（1.0 - spark.memory.fraction），默认值等于（“Java Heap” - 300MB ）* 0.4。例如，使用4GB堆，您将拥有1518.4MB的用户内存。在这部分内存中，将哪些东西存储在内存中以及怎么存，完全取决你。Spark也完全不会关心你在那里做什么，以及是否保证容量不超限。如果超限，可能会导致程序OOM错误。

3、Spark Memory

这是由Spark管理的内存池。它的大小可以计算为（systemMemory - reservedMemory）* spark.memory.fraction，并使用Spark 1.6.0之后默认值给我们（“executor-memory” - 300MB）* 0.6。例如，使用4GB executor-memory，这个池的大小将是2277.6MB。这个整个池分为2个区域 -Storage Memory(存储内存) 和Execution Memory(执行内存) ，它们之间的边界由spark.memory.storageFraction参数设置，默认为0.5。UnifiedMemoryManager内存管理方案的优点是，该边界不是静态的。在某部分内存有压力的情况下，边界将被移动，即一个区域将通过从另一个借用空间而增长。

3.1 Storage Memory

这一块内存用作Spark缓存数据(cache,persist)和序列化数据”unroll”临时空间。另外所有的”broadcast”广播变量都以缓存块存在这里。其实并不需要内存中有足够的空间来存unrolled块- 如果没有足够的内存放下所有的unrolled分区，如果设置的持久化level允许的话，Spark将会把它直接放进磁盘。所有的broadcast变量默认用MEMORY_AND_DISK持久化level缓存。

3.2 Execution Memory

这一块内存是用来存储Spark task执行需要的对象。比如用来存储Map阶段的shuffle临时buffer，此外还用来存储hash合并用到的hash table。当没有足够可用内存时，这块内存同样可以溢写到磁盘，但是这块内存不能被其他线程（tasks）强行剥夺（该内存相对于Storage Memory更重要。Storage Memory内存如果被剥夺，可以通过重算来获得数据。而 Execution Memory一旦剥夺，可能会引起程序失败）。

源码体现：UnifiedMemoryManager.scala

object UnifiedMemoryManager {

  // Set aside a fixed amount of memory for non-storage, non-execution purposes.
  // This serves a function similar to `spark.memory.fraction`, but guarantees that we reserve
  // sufficient memory for the system even for small heaps. E.g. if we have a 1GB JVM, then
  // the memory used for execution and storage will be (1024 - 300) * 0.6 = 434MB by default.
  private val RESERVED_SYSTEM_MEMORY_BYTES = 300 * 1024 * 1024

  def apply(conf: SparkConf, numCores: Int): UnifiedMemoryManager = {
    val maxMemory = getMaxMemory(conf)
    new UnifiedMemoryManager(
      conf,
      maxHeapMemory = maxMemory,
      onHeapStorageRegionSize =
        (maxMemory * conf.getDouble("spark.memory.storageFraction", 0.5)).toLong,
      numCores = numCores)
  }

  /**
   * Return the total amount of memory shared between execution and storage, in bytes.
   */
  private def getMaxMemory(conf: SparkConf): Long = {
    val systemMemory = conf.getLong("spark.testing.memory", Runtime.getRuntime.maxMemory)
    val reservedMemory = conf.getLong("spark.testing.reservedMemory",
      if (conf.contains("spark.testing")) 0 else RESERVED_SYSTEM_MEMORY_BYTES)
    val minSystemMemory = (reservedMemory * 1.5).ceil.toLong
    if (systemMemory < minSystemMemory) {
      throw new IllegalArgumentException(s"System memory $systemMemory must " +
        s"be at least $minSystemMemory. Please increase heap size using the --driver-memory " +
        s"option or spark.driver.memory in Spark configuration.")
    }
    // SPARK-12759 Check executor memory to fail fast if memory is insufficient
    if (conf.contains("spark.executor.memory")) {
      val executorMemory = conf.getSizeAsBytes("spark.executor.memory")
      if (executorMemory < minSystemMemory) {
        throw new IllegalArgumentException(s"Executor memory $executorMemory must be at least " +
          s"$minSystemMemory. Please increase executor memory using the " +
          s"--executor-memory option or spark.executor.memory in Spark configuration.")
      }
    }
    val usableMemory = systemMemory - reservedMemory
    val memoryFraction = conf.getDouble("spark.memory.fraction", 0.6)
    (usableMemory * memoryFraction).toLong
  }
}

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。