spark程序本地测试 spark本地调试local

转载

mob6454cc6bcf40 2023-09-29 23:54:40

文章标签 spark程序本地测试 spark 架构 big data 任务集 文章分类 Spark 大数据

Local运行模式

基本介绍
运行流程图

运行流程详细介绍

实现原理
环境搭建及案例

基本介绍

Spark的Local运行模式又叫本地运行模式、伪分布式模式。之所以这叫本地模式是因为在该模式的Spark的所有进程都运行在本地一台机器的虚拟机中，无需任何资源管理器。它主要是用单机的多个线程来模拟Spark分布式计算，一般是用来进行测试的用途。

本地模式的标准写法是Local[N]模式，这里面的N指的是前面提到的进行多线程模拟Spark分布计算的线程数。如果没有指定N，默认是1个线程（该线程有1个core）。如果是Local[*]，则代表在本地运行Spark，其工作线程数与计算机上的逻辑内核数相同。

运行流程图

本地运行模式的运行流程如下图

spark程序本地测试 spark本地调试local_spark

运行流程详细介绍

1.启动应用程序

启动应用程序即启动SparkContext对象，本阶段主要是对调度器（DAGScheduler、TaskSchedulerImpl
）和本地终端点（LocalBackend、LocalEndpoint）的初始化。

private def createTaskScheduler(
      sc: SparkContext,
      master: String,
      deployMode: String): (SchedulerBackend, TaskScheduler) = {
       ... 
   //未指定运行线程数量时以单线程模式运行，运行时启动给一个线程处理任务
      case "local" =>
        val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
        //启动单线程处理任务
        val backend = new LocalSchedulerBackend(sc.getConf, scheduler, 1)
        scheduler.initialize(backend)
        (backend, scheduler)

      case LOCAL_N_REGEX(threads) =>
        //获取运行节点可以cpu核数，当匹配字符为local[*]时，启动cpu核数得进程数量
        def localCpuCount: Int = Runtime.getRuntime.availableProcessors()
        // local[*] estimates the number of cores on the machine; local[N] uses exactly N threads.
        val threadCount = if (threads == "*") localCpuCount else threads.toInt
        if (threadCount <= 0) {
          throw new SparkException(s"Asked to run locally with $threadCount threads")
        }
        val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
        val backend = new LocalSchedulerBackend(sc.getConf, scheduler, threadCount)
        scheduler.initialize(backend)
        (backend, scheduler)

      case LOCAL_N_FAILURES_REGEX(threads, maxFailures) =>
        def localCpuCount: Int = Runtime.getRuntime.availableProcessors()
        // local[*, M] means the number of cores on the computer with M failures
        // local[N, M] means exactly N threads with M failures
        val threadCount = if (threads == "*") localCpuCount else threads.toInt
        val scheduler = new TaskSchedulerImpl(sc, maxFailures.toInt, isLocal = true)
        val backend = new LocalSchedulerBackend(sc.getConf, scheduler, threadCount)
        scheduler.initialize(backend)
        (backend, scheduler)
      ... 
      }

2.执行作业，创建Executor并运行任务

对作业的执行首先为对划分调度状态，形成任务集。然后将任务集按照拆分的顺序发送给本地终端点LocalEndpoint，其在接收到任务集后，就在本地启动Executor，启动后，直接在启动的Executor上执行接收到的任务集。

private[spark] class LocalEndpoint(
    override val rpcEnv: RpcEnv,
    userClassPath: Seq[URL],
    scheduler: TaskSchedulerImpl,
    executorBackend: LocalSchedulerBackend,
    private val totalCores: Int)
  extends ThreadSafeRpcEndpoint with Logging {
	...
//启动executor，启动islocal为真表示本地启动
  private val executor = new Executor(
    localExecutorId, localExecutorHostname, SparkEnv.get, userClassPath, isLocal = true)
	
	...
	def reviveOffers() {
    val offers = IndexedSeq(new WorkerOffer(localExecutorId, localExecutorHostname, freeCores,
      Some(rpcEnv.address.hostPort)))
    //根据设置线程数启动相应得线程处理任务
    for (task <- scheduler.resourceOffers(offers).flatten) {
      freeCores -= scheduler.CPUS_PER_TASK
      executor.launchTask(executorBackend, task)
    }
  }
}

如果设置了多线程，则启动多个Executor并行处理任务

3.反馈任务执行状态

Executor负责执行任务，本地终端点LocalEndpoint将任务执行的状态反馈给上层的作业调度器。上层的作业调度器根据接收到的消息更新任务状态，同时根据这个反馈，实时的调整整个任务集的状态。

private[spark] class LocalEndpoint(
    override val rpcEnv: RpcEnv,
    userClassPath: Seq[URL],
    scheduler: TaskSchedulerImpl,
    executorBackend: LocalSchedulerBackend,
    private val totalCores: Int)
  extends ThreadSafeRpcEndpoint with Logging {
	...
  //任务更新
    case StatusUpdate(taskId, state, serializedData) =>
      scheduler.statusUpdate(taskId, state, serializedData)
      if (TaskState.isFinished(state)) {
        freeCores += scheduler.CPUS_PER_TASK
        reviveOffers()
      }
    ...
  }