SPARK调度过程 spark调度机制

转载

mob6454cc6658d1 2023-08-29 16:39:23

文章标签 SPARK调度过程 master分配算法 SpreadOut App 调度算法 文章分类 Spark 大数据

Driver向Master进行Application注册的时候，Master注册完之后，会调用schedule()方法，进行资源调度。下面我们对schedule()源码进行分析，schedule()源码如下：

private def schedule(): Unit = {
    // 首先判断master状态不是alive的话，直接返回，也就是说standby是不会进行资源调度的
    if (state != RecoveryState.ALIVE) { return }
    // Drivers take strict precedence over executors
    // Random.shuffle的原理主要是遍历整个ArrayBuffer,随机交换从后往前的两个位置的数
    // 对传入集合中的元素进行随机打乱
    val shuffledWorkers = Random.shuffle(workers) // Randomization helps balance drivers

    // 取出worker中之前所有注册上来的worker，进行过滤，worker必须是alive状态
    for (worker <- shuffledWorkers if worker.state == WorkerState.ALIVE) {
      // 首先调度Driver。
      // 为什么要调度Driver，什么情况下会注册Driver，调度Driver？
      // 其实只有用yarn-cluster模式提交的时候，才会注册Driver，因为Standalone和yarn-client模式
      // 都直接在本地启动Driver，不会注册Driver，更不会让master调度Driver了
      // 遍历等待调度的Driver
      for (driver <- waitingDrivers) {
        // 如果当前worker的空闲内存量，大于等于Driver需要的内存，
        // 并且worker的空闲cpu core，大于Driver所需的cpu数量
        // Driver是在Worker上启动的。。因此Worker节点的内存和CPU需要能够让Driver运行
        if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) {
          // 启动Driver
          launchDriver(worker, driver)
          // 从缓存中移除
          waitingDrivers -= driver
        }
      }
    }
    // 对worker的调度机制，在worker上启动executor
    startExecutorsOnWorkers()
  }

从上述源码中可以看出，首先对所有已注册的worker进行随机打散；接着进行遍历，去除不是alive状态的worker，首先对Driver进行调度，为什么一开始要调度Driver？

因为只有在yarn-cluster模式下才需要调度Driver，在这个模式下，YARN需要找一个NodeManager来启动Driver，因此需要在已注册的worker节点集合中寻找满足条件的worker，来启动Driver。由于standalone和yarn-client都是在本地启动Driver，所以无需进行调度。

调度完Driver之后，就正式开始进行executor的调度了，调用了方法startExecutorsOnWorkers()，源码如下：

private def startExecutorsOnWorkers(): Unit = {
    // Right now this is a very simple FIFO scheduler. We keep trying to fit in the first app
    // in the queue, then the second app, etc.
    // Application的调度机制，默认采用的是spreadOutApps调度算法

    // 首先遍历waitingApps中的appInfo，并且过滤出，还需要调度的core的app，
    // 说白了，就是处理app需要的cpu core
    for (app <- waitingApps if app.coresLeft > 0) {
      // 这是脚本文件中的 --executor-cores 这个参数
      val coresPerExecutor: Option[Int] = app.desc.coresPerExecutor
      // Filter out workers that don't have enough resources to launch an executor
      // 过滤掉没有足够资源启动的worker
      // 从worker中过滤出状态为alive的worker，并且这个worker的资源能够被Application使用
      // 然后按照剩余cpu数量倒叙排序，从大到小排序
      val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)
        .filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB &&
          worker.coresFree >= coresPerExecutor.getOrElse(1))
        .sortBy(_.coresFree).reverse

      // cpu core 和 memory 资源分配
      val assignedCores = scheduleExecutorsOnWorkers(app, usableWorkers, spreadOutApps)

      // Now that we've decided how many cores to allocate on each worker, let's allocate them
      // 给每个worker分配完资源给application之后
      // 遍历每个worker节点
      for (pos <- 0 until usableWorkers.length if assignedCores(pos) > 0) {
        // 启动executor
        allocateWorkerResourceToExecutors(
          app, assignedCores(pos), coresPerExecutor, usableWorkers(pos))
      }
    }
  }

这里的调度机制默认是采用FIFO的调度算法，这个算法之前有介绍过，这里不做阐述；下面进行源码分析。

首先在等待调度队列的App中取出一个Application，并且过滤出还需要调度core的App（还没有调度完的App）；

接着读取App中的coresPerExecutors参数，代表了每个executor被分配的多少个cpu core（spark-submit脚本中可设置的参数 --executor-cores）；

然后下面这句话比较复杂，它的意思是，从worker中过滤出状态为alive的worker，然后对这些worker再过滤出它们的内存和cpu core能够启动Application的worker，也就是对存活着的worker中过滤出有足够资源去启动Application的worker，并按照cpu core的大小降序排序，下一步使用scheduleExecutorsOnWorkers()方法，给每个worker分配资源，最后使用allocateWorkerResourceToExecutors()启动executor。

我们下面就看看，worker节点怎么被分配executor，scheduleExecutorsOnWorkers()源码如下：

private def scheduleExecutorsOnWorkers(
      app: ApplicationInfo,
      usableWorkers: Array[WorkerInfo],
      spreadOutApps: Boolean): Array[Int] = {
    // --executor-cores app中每个executor被分配的cores
    val coresPerExecutor = app.desc.coresPerExecutor
    // 每个worker最少被分配的cpu core，默认就是coresPerExecutor
    val minCoresPerExecutor = coresPerExecutor.getOrElse(1)
    // 如果没有设置--executor-cores 参数的话，就默认分配一个executor
    val oneExecutorPerWorker = coresPerExecutor.isEmpty
    // --executor-memory 每个executor要被分配的内存大小
    val memoryPerExecutor = app.desc.memoryPerExecutorMB
    // 可用worker的个数
    val numUsable = usableWorkers.length
    // 创建一个空数组，存储了要分配给每个worker的cpu core数量
    val assignedCores = new Array[Int](numUsable) // Number of cores to give to each worker
    // 每个worker上分配几个executor
    val assignedExecutors = new Array[Int](numUsable) // Number of new executors on each worker
    // 获取需要分配的core的数量，取app剩余还需要分配的cpu core数量 和 worker总共可用CPU core数量的最小值
    // 如果worker资源总数不够，那么只能先分配这么多cpu core
    // app.coresLeft = requestedCores - coresGranted，
    // 其中requestedCores代表app需要分配多少个cpu core，coresGranted代表当前集群worker节点已经被分配了多少个core
    var coresToAssign = math.min(app.coresLeft, usableWorkers.map(_.coresFree).sum)

    /** Return whether the specified worker can launch an executor for this app. */
    // 判断当前worker剩余 cpu core 和 memory是否能够分配给app运行
    def canLaunchExecutor(pos: Int): Boolean = {
      // 只要当前剩余的cpu core，还没有被分配完，这个标志位就是true
      val keepScheduling = coresToAssign >= minCoresPerExecutor
      // 当前worker剩余的core是否能够分配给App
      val enoughCores = usableWorkers(pos).coresFree - assignedCores(pos) >= minCoresPerExecutor

      // If we allow multiple executors per worker, then we can always launch new executors.
      // Otherwise, if there is already an executor on this worker, just give it more cores.
      // 每个worker资源足够的情况下，可以启动多个executor，
      // 否则的话，就给当前启动的这个worker足够的core和memory
      val launchingNewExecutor = !oneExecutorPerWorker || assignedExecutors(pos) == 0
      if (launchingNewExecutor) {
        // 当前节点已分配出去的内存
        val assignedMemory = assignedExecutors(pos) * memoryPerExecutor
        // 剩余内存是否足够分配
        val enoughMemory = usableWorkers(pos).memoryFree - assignedMemory >= memoryPerExecutor
        val underLimit = assignedExecutors.sum + app.executors.size < app.executorLimit
        keepScheduling && enoughCores && enoughMemory && underLimit
      } else {
        // 在一个worker持续分配资源
        keepScheduling && enoughCores
      }
    }
    
    // 过滤掉资源不够的worker
    var freeWorkers = (0 until numUsable).filter(canLaunchExecutor)
    while (freeWorkers.nonEmpty) {
      freeWorkers.foreach { pos =>
        var keepScheduling = true
        while (keepScheduling && canLaunchExecutor(pos)) {
          // 当前worker内存和core资源足够分配给一个app
          // 可用worker可用core减去已分配的core
          coresToAssign -= minCoresPerExecutor
          // pos节点worker已被分配出去多少core
          assignedCores(pos) += minCoresPerExecutor

          // If we are launching one executor per worker, then every iteration assigns 1 core
          // to the executor. Otherwise, every iteration assigns cores to a new executor.
          if (oneExecutorPerWorker) {
            // 一个worker点一个executor
            assignedExecutors(pos) = 1
          } else {
            // 资源足够的情况下，一个worker可以分配多个executor
            assignedExecutors(pos) += 1
          }

          // Spreading out an application means spreading out its executors across as
          // many workers as possible. If we are not spreading out, then we should keep
          // scheduling executors on this worker until we use all of its resources.
          // Otherwise, just move on to the next worker.
          // 如果是spreadOutApps模式下，那么就每个worker在分配executor后，接着就到下一个worker上分配
          // 循环分配，保证每个可用worker都可以分配到executor
          // 如果不是，那么就给当前这个executor一直分配core，直到这个executor所在的节点资源
          // 已经都分配完了。
          if (spreadOutApps) {
            keepScheduling = false
          }
        }
      }
      // 过滤掉资源不足的worker
      freeWorkers = freeWorkers.filter(canLaunchExecutor)
    }
    assignedCores
  }

从上面代码中可以看出，master的资源调度算法主要有两个：一个是SpreadOut算法，另一个是非SpreadOut算法，它两的区别从源码中就可以看出来，SpreadOut算法，是将executor尽可能的分配到较多的worker节点上，这样做的好处是，每个节点都能工作，防止资源浪费，而第二种就是将executor尽可能少的分配到worker，直到这个worker资源不足，才到下一个worker上分配资源。这里注意，这个算法相对老版本的算法做了优化，老版本中比如（1.3，我之前研究的版本），分配core的单位是1个，而这里则是按照我们spark-submit脚本中配置的--executor-cores，为单位进行分配，这里要注意。

下面举个例子，

假设Application需要 4个executor，每个executor2个cpu core，假如现在有4个worker，每个worker 4个cpu core，那么SpreadOut算法就会让4个worker都启动一个executor，每个executor2个cpu core，而非SpreadOut算法，则启动2个worker，每个worker启动一个executor，每个executor 4个cpu core。有两个worker节点的资源就浪费了。

综上所述，master资源调度算法主要是两种：一种是SpreadOut，一种是非SpreadOut算法，两个方法的区别主要一个可以合理利用各个worker上的资源，一个是最大限度的每个使用worker上的资源，有可能造成浪费。

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。