文章目录
- 1.提交命令
- 2.源码分析
- 3.名词解析
1.提交命令
在实际生产中,都是使用 yarn-cluster
模式提交 spark
任务,例如:
spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode client \
./examples/jars/spark-examples_2.11-2.3.2.3.1.0.0-78.jar \
10
2.源码分析
执行提交命令之后,首先会调用 $SPARK_HOME/bin/spark-submit
脚本,在 spark-submit
可执行文件中发现:
exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"
查看 bin/spark-class
可执行文件,最后会发现执行提交任务命令:/bin/java org.apache.spark.deploy.SparkSubmit --master --class
那么肯定会调用执行 SparkSubmit
的 main
方法作为程序入口,使用 IDEA 打开 Spark 源码项目(快捷键 Control+Shift+N
,或者双击 Shift
)去源码中查找 "org.apache.spark.deploy.SparkSubmit"
,在 Scala 程序中直接查找伴生对象。
override def main(args: Array[String]): Unit = {
// 1.这里先创建了 SparkSubmit 实例
val submit = new SparkSubmit() {
self =>
// 重写了 class SparkSubmit 的解析加载参数方法
override protected def parseArguments(args: Array[String]): SparkSubmitArguments = {
new SparkSubmitArguments(args) {
override protected def logInfo(msg: => String): Unit = self.logInfo(msg)
override protected def logWarning(msg: => String): Unit = self.logWarning(msg)
override protected def logError(msg: => String): Unit = self.logError(msg)
}
}
// 日志输出方法
override protected def logInfo(msg: => String): Unit = printMessage(msg)
// warning输出方法
override protected def logWarning(msg: => String): Unit = printMessage(s"Warning: $msg")
// error输出方法
override protected def logError(msg: => String): Unit = printMessage(s"Error: $msg")
// 3.重写任务提交方法,捕获异常
override def doSubmit(args: Array[String]): Unit = {
try {
// 4.这里会进入 class SparkSubmit 的 doSubmit()
super.doSubmit(args)
} catch {
case e: SparkUserAppException =>
exitFn(e.exitCode)
}
}
}
// 2.调用上面 SparkSubmit 实例的 doSubmit()
submit.doSubmit(args)
}
可以看到, 执行 main
方法,并把参数传入 args: Array[String]
,调用 SparkSubmit实例的doSubmit(args)
并把参数传入,然后调用父类 class SparkSubmit的doSubmit(args)
并传入参数。
// 5.执行Submit方法
def doSubmit(args: Array[String]): Unit = {
// Initialize logging if it hasn't been done yet. Keep track of whether logging needs to
// be reset before the application starts.
// 初始化 logging 系统,并跟日志判断是否需要在 app 启动时重启
val uninitLog: Boolean = initializeLogIfNecessary(isInterpreter = true, silent = true)
// 6.调用 parseArguments() 解析参数,解析了提交的 args 参数及 spark 配置文件
val appArgs: SparkSubmitArguments = parseArguments(args)
// 参数不重复则输出配置
if (appArgs.verbose) {
logInfo(appArgs.toString)
}
// 匹配输入的执行请求,也就是提交,终止,请求状态和打印版本
// 在解析的时候将执行状态封装到了SparkSubmitAction中,这里进行匹配
// 如果没有执行状态,则SparkSubmitArguments默认设置为SparkSubmitAction.SUBMIT
// 这里提交会进入submit()
appArgs.action match {
case SparkSubmitAction.SUBMIT => submit(appArgs, uninitLog)
case SparkSubmitAction.KILL => kill(appArgs)
case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs)
case SparkSubmitAction.PRINT_VERSION => printVersion()
}
}
在执行 doSubmit
方法时,调用 parseArguments(args)
进行参数解析,调用方法如下:
/**
* 解析参数的方法
* 这里首先进入了Object SparkSubmit重写的parseArguments()中
* parseArguments其实就是SparkSubmitArguments类的实例,先创建了SparkSubmitArguments(args)实例
* 而SparkSubmitArguments继承了SparkSubmitArgumentsParser抽象类
* SparkSubmitArgumentsParser继承了SparkSubmitOptionParser
* SparkSubmitOptionParser其实也是launcher.main中解析参数的OptionParser.parser()继承的父类
* SparkSubmitArguments类中,定义了一堆参数,其实就是各种运行模式需要的参数。
* 这里解析了submit所有模式需要的参数和spark默认配置
*/
protected def parseArguments(args: Array[String]): SparkSubmitArguments = {
new SparkSubmitArguments(args)
}
点击 SparkSubmitArguments
会发现调用的方法:
// Set parameters from command line arguments
// 解析命令行参数
parse(args.asJava)
点击 parse
之后,会发现是通过正则表达式,对命令行参数进行解析:
其中 handle
处理参数的方法,是调用的 SparkSubmitArguments
中的 handle
方法:
形如下图的参数解析示例:
在 SparkSubmitArguments
中,存在一个参数 action
:
# 声明初始化 action 参数
var action: SparkSubmitAction = null
# 做了一个判断,如果 action 没有值,则会赋值 “SUBMIT”,默认就是提交
action = Option(action).getOrElse(SUBMIT)
# 返回前面代码
appArgs.action match {
case SparkSubmitAction.SUBMIT => submit(appArgs, uninitLog)
......
}
默认就是 submit
提交。
@tailrec
private def submit(args: SparkSubmitArguments, uninitLog: Boolean): Unit = {
// 主要是调用runMain()启动相应环境的main()的方法
// 环境准备好以后,会先往下运行判断,这里是在等着调用
// #2
def doRunMain(): Unit = {
// 提交时可以指定--proxy-user,如果没有指定,则获取当前用户
if (args.proxyUser != null) {
val proxyUser = UserGroupInformation.createProxyUser(args.proxyUser,
UserGroupInformation.getCurrentUser())
try {
proxyUser.doAs(new PrivilegedExceptionAction[Unit]() {
// 这里是真正的执行,runMain()
override def run(): Unit = {
runMain(args, uninitLog)
}
})
} catch {
case e: Exception =>
// Hadoop's AuthorizationException suppresses the exception's stack trace, which
// makes the message printed to the output by the JVM not very helpful. Instead,
// detect exceptions with empty stack traces here, and treat them differently.
// hadoop的权限验证不允许堆栈跟踪,所以这里以空的堆栈来判断异常
if (e.getStackTrace().length == 0) {
error(s"ERROR: ${e.getClass().getName()}: ${e.getMessage()}")
} else {
throw e
}
}
} else {
// #3 定义了用户则直接执行runMain()
runMain(args, uninitLog)
}
}
// In standalone cluster mode, there are two submission gateways:
// (1) The traditional RPC gateway using o.a.s.deploy.Client as a wrapper
// (2) The new REST-based gateway introduced in Spark 1.3
// The latter is the default behavior as of Spark 1.3, but Spark submit will fail over
// to use the legacy gateway if the master endpoint turns out to be not a REST server.
// standalone模式有两种提交网关,
// 使用o.a.s.apply.client作为包装器的传统RPC网关和基于REST服务的网关,spark1.3后默认使用REST
// 如果master终端没有使用REST服务,spark会故障切换到RPC
// 这里判断standalone模式和使用REST服务
if (args.isStandaloneCluster && args.useRest) {
// 异常捕获,判断正确的话输出信息,进入doRunMain()
try {
logInfo("Running Spark using the REST application submission protocol.")
doRunMain()
} catch {
// Fail over to use the legacy submission gateway
// 否则异常输出信息,并设置submit失败
case e: SubmitRestConnectionException =>
logWarning(s"Master endpoint ${args.master} was not a REST server. " +
"Falling back to legacy submission gateway instead.")
args.useRest = false
submit(args, uninitLog = false)
}
// In all other modes, just run the main class as prepared
// 其他模式,按准备的环境调用上面的doRunMain()运行相应的main()
// 在进入前,初始化了SparkContext和SparkSession
} else {
// #1
doRunMain()
}
}
private def runMain(args: SparkSubmitArguments, uninitLog: Boolean): Unit = {
// #1 特别重要,先准备运行环境,传入解析的各种参数
val (childArgs, childClasspath, sparkConf, childMainClass) = prepareSubmitEnvironment(args)
// Let the main class re-initialize the logging system once it starts.
// 启动main后重新初始化logging
if (uninitLog) {
Logging.uninitialize()
}
if (args.verbose) {
logInfo(s"Main class:\n$childMainClass")
logInfo(s"Arguments:\n${childArgs.mkString("\n")}")
// sysProps may contain sensitive information, so redact before printing
logInfo(s"Spark config:\n${Utils.redact(sparkConf.getAll.toMap).mkString("\n")}")
logInfo(s"Classpath elements:\n${childClasspath.mkString("\n")}")
logInfo("\n")
}
// #2 获取类加载器
val loader = getSubmitClassLoader(sparkConf)
for (jar <- childClasspath) {
addJarToClasspath(jar, loader)
}
var mainClass: Class[_] = null
try {
// #3 根据类的名称(字符串)获取类的信息,反射
mainClass = Utils.classForName(childMainClass)
} catch {
case e: ClassNotFoundException =>
logError(s"Failed to load class $childMainClass.")
if (childMainClass.contains("thriftserver")) {
logInfo(s"Failed to load main class $childMainClass.")
logInfo("You need to build Spark with -Phive and -Phive-thriftserver.")
}
throw new SparkUserAppException(CLASS_NOT_FOUND_EXIT_STATUS)
case e: NoClassDefFoundError =>
logError(s"Failed to load $childMainClass: ${e.getMessage()}")
if (e.getMessage.contains("org/apache/hadoop/hive")) {
logInfo(s"Failed to load hive class.")
logInfo("You need to build Spark with -Phive and -Phive-thriftserver.")
}
throw new SparkUserAppException(CLASS_NOT_FOUND_EXIT_STATUS)
}
// #4 判断 mainClass 是否继承 SparkApplication 如果继承,进入 if,否则进入 else
val app: SparkApplication = if (classOf[SparkApplication].isAssignableFrom(mainClass)) {
// #5 mainClass 通过构造器创建实例,并且转换成 SparkApplication
mainClass.getConstructor().newInstance().asInstanceOf[SparkApplication]
} else {
// #6 mainClass new 一个 JavaMainApplication
new JavaMainApplication(mainClass)
}
@tailrec
def findCause(t: Throwable): Throwable = t match {
case e: UndeclaredThrowableException =>
if (e.getCause() != null) findCause(e.getCause()) else e
case e: InvocationTargetException =>
if (e.getCause() != null) findCause(e.getCause()) else e
case e: Throwable =>
e
}
try {
// #7
app.start(childArgs.toArray, sparkConf)
} catch {
case t: Throwable =>
throw findCause(t)
}
}
上面代码可以看出,特别重要的就是 childMainClass
,根据 childMainClass
得到 mainClass
,继而进行后续一系列操作。而 childMainClass
是 prepareSubmitEnvironment
的返回的一个参数。
private[deploy] def prepareSubmitEnvironment(
args: SparkSubmitArguments,
conf: Option[HadoopConfiguration] = None)
: (Seq[String], Seq[String], SparkConf, String) = {
......
// #1 声明初始化 childMainClass
var childMainClass = ""
......
// #2 判断集群模式,本次采用的是 yarn cluster模式
if (isYarnCluster) {
// #3 重新赋值
childMainClass = YARN_CLUSTER_SUBMIT_CLASS
if (args.isPython) {
childArgs += ("--primary-py-file", args.primaryResource)
childArgs += ("--class", "org.apache.spark.deploy.PythonRunner")
} else if (args.isR) {
val mainFile = new Path(args.primaryResource).getName
childArgs += ("--primary-r-file", mainFile)
childArgs += ("--class", "org.apache.spark.deploy.RRunner")
} else {
if (args.primaryResource != SparkLauncher.NO_RESOURCE) {
childArgs += ("--jar", args.primaryResource)
}
childArgs += ("--class", args.mainClass)
}
if (args.childArgs != null) {
args.childArgs.foreach { arg => childArgs += ("--arg", arg) }
}
}
......
// #4 返回结果
(childArgs.toSeq, childClasspath.toSeq, sparkConf, childMainClass)
}
追踪代码,可以得到 childMainClass
:
private[deploy] val YARN_CLUSTER_SUBMIT_CLASS = "org.apache.spark.deploy.yarn.YarnClusterApplication"
// 即,变量赋值:
childMainClass = org.apache.spark.deploy.yarn.YarnClusterApplication
找不到 org.apache.spark.deploy.yarn.YarnClusterApplication
源码
解决方案如下,找到 spark-core模块的pom.xml
文件,添加依赖:
<!-- 导入spark-yarn的依赖,用于分析源码 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-yarn_${scala.binary.version}</artifactId>
<version>3.1.0</version>
</dependency>
// 可以看到 YarnClusterApplication 继承了 SparkApplication,
// 对应 runMain 方法中,对 mainClass 的判断选择
private[spark] class YarnClusterApplication extends SparkApplication {
override def start(args: Array[String], conf: SparkConf): Unit = {
// SparkSubmit would use yarn cache to distribute files & jars in yarn mode,
// so remove them from sparkConf here for yarn mode.
conf.remove(JARS)
conf.remove(FILES)
// #1 重点就是 Client
new Client(new ClientArguments(args), conf, null).run()
}
}
// 对应 runMain 方法中,对 mainClass 的判断选择
// #4 判断 mainClass 是否继承 SparkApplication 如果继承,进入 if,否则进入 else
val app: SparkApplication = if (classOf[SparkApplication].isAssignableFrom(mainClass)) {
// #5 mainClass 通过构造器创建实例,并且转换成 SparkApplication
mainClass.getConstructor().newInstance().asInstanceOf[SparkApplication]
} else {
// #6 mainClass new 一个 JavaMainApplication
new JavaMainApplication(mainClass)
}
......
try {
// #7 start
app.start(childArgs.toArray, sparkConf)
} catch {
case t: Throwable =>
throw findCause(t)
}
YarnClusterApplication
其中有关参数的解析,ClientArguments
就是针对参数进行解析的方法。
最重要的就是Client
。
// #1 Client.scala
private val yarnClient = YarnClient.createYarnClient
// #2 YarnClient.java
@Public
public static YarnClient createYarnClient() {
YarnClient client = new YarnClientImpl();
return client;
}
// #3 YarnClientImpl.java
protected ApplicationClientProtocol rmClient;
YarnClusterApplication
中 run()
new Client(new ClientArguments(args), conf, null).run()
// run()之后,调用 submitApplication(),返回 appId
def run(): Unit = {
this.appId = submitApplication()
......
}
// 经过源码追踪,发现
submitApplication => createContainerLaunchContext => 会赋值参数:
amClass = "org.apache.spark.deploy.yarn.ApplicationMaster"(集群模式)
amClass = "org.apache.spark.deploy.yarn.ExecutorLauncher"(非集群模式)
双击 shift
,查询 org.apache.spark.deploy.yarn.ApplicationMaster
在 ApplicationMaster
的 main
方法中,会调用 ApplicationMasterArguments
方法,来解析参数;
class ApplicationMasterArguments(val args: Array[String]) {
...
parseArgs(args.toList)
...
private def parseArgs(inputArgs: List[String]): Unit = {
...
// 通过模式匹配,来解析参数
case ("--jar") :: value :: tail =>
userJar = value
args = tail
case ("--class") :: value :: tail =>
userClass = value
args = tail
...
}
}
在 ApplicationMaster
的 main
方法中,会调用 ApplicationMaster
代码:
val yarnConf = new YarnConfiguration(SparkHadoopUtil.newConfiguration(sparkConf))
master = new ApplicationMaster(amArgs, sparkConf, yarnConf)
ApplicationMaster
会创建 Yarn RM Client 对象:
private val client = new YarnRMClient()
YarnRMClient
中 AMRMClient
含义是:创建 ApplicationMaster
和 ResourceManager
的客户端;
private[spark] class YarnRMClient extends Logging {
private var amClient: AMRMClient[ContainerRequest] = _
...
}
在 ApplicationMaster
的 main
方法中,最后会调用 run
方法:
ugi.doAs(new PrivilegedExceptionAction[Unit]() {
override def run(): Unit = System.exit(master.run())
})
run
方法,集群模式,调用 runDriver
方法
final def run(): Int = {
...
if (isClusterMode) {
runDriver()
} else {
runExecutorLauncher()
}
...
}
runDriver
方法,调用方法 startUserApplication
,启动用户的应用程序。
private def runDriver(): Unit = {
addAmIpFilter(None, System.getenv(ApplicationConstants.APPLICATION_WEB_PROXY_BASE_ENV))
// 启动用户的应用程序
userClassThread = startUserApplication()
...
val sc = ThreadUtils.awaitResult(sparkContextPromise.future,
Duration(totalWaitTime, TimeUnit.MILLISECONDS))
...
}
startUserApplication
中代码,有一个类加载器,加载 args.userClass
,并且得到该类的 main
方法,然后创建一个线程, 该线程名字叫 Driver
,并且启动了。
private def startUserApplication(): Thread = {
...
val mainMethod = userClassLoader.loadClass(args.userClass)
.getMethod("main", classOf[Array[String]])
val userThread = new Thread {
override def run(): Unit = {
......
}
}
userThread.setContextClassLoader(userClassLoader)
userThread.setName("Driver")
userThread.start()
userThread
}
// 经过代码追踪,该 args.userClass 是 ApplicationMasterArguments 中,通过模式匹配 --class 获取的类,这个是提交程序时用过 --class 指定的类。
case ("--class") :: value :: tail =>
userClass = value
args = tail
启动 Driver
:
private def runDriver(): Unit = {
addAmIpFilter(None, System.getenv(ApplicationConstants.APPLICATION_WEB_PROXY_BASE_ENV))
// 启动用户的应用程序
userClassThread = startUserApplication()
...
val sc = ThreadUtils.awaitResult(sparkContextPromise.future,
Duration(totalWaitTime, TimeUnit.MILLISECONDS))
...
// 通信环境
val rpcEnv = sc.env.rpcEnv
...
// 注册AM,申请资源
registerAM(host, port, userConf, sc.ui.map(_.webUrl), appAttemptId)
}
private def runDriver(): Unit = {
addAmIpFilter(None, System.getenv(ApplicationConstants.APPLICATION_WEB_PROXY_BASE_ENV))
// 启动用户的应用程序
userClassThread = startUserApplication()
...
val sc = ThreadUtils.awaitResult(sparkContextPromise.future,
Duration(totalWaitTime, TimeUnit.MILLISECONDS))
...
// 通信环境
val rpcEnv = sc.env.rpcEnv
...
// 注册AM,申请资源
registerAM(host, port, userConf, sc.ui.map(_.webUrl), appAttemptId)
...
// 创建分配器
createAllocator(driverRef, userConf, rpcEnv, appAttemptId, distCacheConf)
}
在 runDriver
中,createAllocator
创建分配器:
private def createAllocator(
driverRef: RpcEndpointRef,
_sparkConf: SparkConf,
rpcEnv: RpcEnv,
appAttemptId: ApplicationAttemptId,
distCacheConf: SparkConf): Unit = {
...
// 该处的client 是 YarnRMClient
allocator = client.createAllocator(
yarnConf,
_sparkConf,
appAttemptId,
driverUrl,
driverRef,
securityMgr,
localResources)
...
// yarn 返回的,通过分配器获取的,可分配资源
allocator.allocateResources()
}
def allocateResources(): Unit = synchronized {
updateResourceRequests()
val progressIndicator = 0.1f
// Poll the ResourceManager. This doubles as a heartbeat if there are no pending container
// requests.
val allocateResponse = amClient.allocate(progressIndicator)
// 可分配容器
val allocatedContainers = allocateResponse.getAllocatedContainers()
allocatorBlacklistTracker.setNumClusterNodes(allocateResponse.getNumClusterNodes)
// 如果可分配容器大于0,说明还有资源可供使用,即可进行分配
if (allocatedContainers.size > 0) {
logDebug(("Allocated containers: %d. Current executor count: %d. " +
"Launching executor count: %d. Cluster resources: %s.")
.format(
allocatedContainers.size,
getNumExecutorsRunning,
getNumExecutorsStarting,
allocateResponse.getAvailableResources))
// 处理可用于分配的容器
handleAllocatedContainers(allocatedContainers.asScala.toSeq)
}
...
}
def handleAllocatedContainers(allocatedContainers: Seq[Container]): Unit = {
// 把可分配容器进行分类,比如同一台主机,同一个机架
// 通过首选位置,可以把Task发送到最优的容器中进行执行
......
// 运行已分配容器
runAllocatedContainers(containersToUse)
}
runAllocatedContainers
运行可分配容器,使用线程池,启动 Executor,执行 run
方法
private def runAllocatedContainers(containersToUse: ArrayBuffer[Container]): Unit = synchronized {
for (container <- containersToUse) {
if (rpRunningExecs < getOrUpdateTargetNumExecutorsForRPId(rpId)) {
getOrUpdateNumExecutorsStartingForRPId(rpId).incrementAndGet()
if (launchContainers) {
// 线程池
launcherPool.execute(() => {
try {
// 启动Executor
new ExecutorRunnable(
Some(container),
conf,
sparkConf,
driverUrl,
executorId,
executorHostname,
containerMem,
containerCores,
appAttemptId.getApplicationId.toString,
securityMgr,
localResources,
rp.id
).run() // run
updateInternalState()
} catch {
...
}
}
}
}
}
}
run
方法:
def run(): Unit = {
logDebug("Starting Executor Container")
// 创建与 NodeManager 连接客户端
nmClient = NMClient.createNMClient()
nmClient.init(conf)
nmClient.start()
// 启动 Container
startContainer()
}
startContainer
启动 Container
def startContainer(): java.util.Map[String, ByteBuffer] = {
...
// 环境信息,prepareCommand 准备指令
val commands = prepareCommand()
ctx.setCommands(commands.asJava)
...
try {
// 让某一个 NodeManager 启动 Container
nmClient.startContainer(container.get, ctx)
} catch {
...
}
}
prepareCommand
准备指令:/bin/java org.apache.spark.executor.YarnCoarseGrainedExecutorBackend
private def prepareCommand(): List[String] = {
...
val commands = prefixEnv ++
Seq(Environment.JAVA_HOME.$$() + "/bin/java", "-server") ++
javaOpts ++
Seq("org.apache.spark.executor.YarnCoarseGrainedExecutorBackend",
"--driver-url", masterAddress,
"--executor-id", executorId,
"--hostname", hostname,
"--cores", executorCores.toString,
"--app-id", appId,
"--resourceProfileId", resourceProfileId.toString) ++
userClassPath ++
Seq(
s"1>${ApplicationConstants.LOG_DIR_EXPANSION_VAR}/stdout",
s"2>${ApplicationConstants.LOG_DIR_EXPANSION_VAR}/stderr")
...
}
双击 Shift
或者 Control + Shift + N
,搜索 org.apache.spark.executor.YarnCoarseGrainedExecutorBackend
,找到 main
方法:
调用 CoarseGrainedExecutorBackend
的 run
方法:
private[spark] object YarnCoarseGrainedExecutorBackend extends Logging {
def main(args: Array[String]): Unit = {
...
CoarseGrainedExecutorBackend.run(backendArgs, createFn)
System.exit(0)
}
}
run
方法,创建了 Executor
的运行环境, 并且,创建了设置了 Endpoint
,终端
def run(
arguments: Arguments,
backendCreateFn: (RpcEnv, Arguments, SparkEnv, ResourceProfile) =>
CoarseGrainedExecutorBackend): Unit = {
...
driverConf.set(EXECUTOR_ID, arguments.executorId)
// Executor 运行环境
val env = SparkEnv.createExecutorEnv(driverConf, arguments.executorId, arguments.bindAddress,
arguments.hostname, arguments.cores, cfg.ioEncryptionKey,
isLocal = false)
// Endpoint 终端
// backendCreateFn 把创建的对象设定为终端
env.rpcEnv.setupEndpoint("Executor",
backendCreateFn(env.rpcEnv, arguments, env, cfg.resourceProfile))
arguments.workerUrl.foreach { url =>
env.rpcEnv.setupEndpoint("WorkerWatcher", new WorkerWatcher(env.rpcEnv, url))
}
env.rpcEnv.awaitTermination()
}
对 setupEndpoint
方法进行查看,查看到其抽象方法:
def setupEndpoint(name: String, endpoint: RpcEndpoint): RpcEndpointRef
使用快捷键 Control + Alt + B
,查看抽象方法的实现:
override def setupEndpoint(name: String, endpoint: RpcEndpoint): RpcEndpointRef = {
// 注册 RPC 通信的终端
dispatcher.registerRpcEndpoint(name, endpoint)
}
def registerRpcEndpoint(name: String, endpoint: RpcEndpoint): NettyRpcEndpointRef = {
// 通信地址
val addr = RpcEndpointAddress(nettyEnv.address, name)
// 通信引用
val endpointRef = new NettyRpcEndpointRef(nettyEnv.conf, addr, nettyEnv)
synchronized {
if (stopped) {
throw new IllegalStateException("RpcEnv has been stopped")
}
if (endpoints.containsKey(name)) {
throw new IllegalArgumentException(s"There is already an RpcEndpoint called $name")
}
endpointRefs.put(endpoint, endpointRef)
// 消息循环器
var messageLoop: MessageLoop = null
try {
messageLoop = endpoint match {
case e: IsolatedRpcEndpoint =>
new DedicatedMessageLoop(name, e, this)
case _ =>
sharedLoop.register(name, endpoint)
sharedLoop
}
endpoints.put(name, messageLoop)
} catch {
case NonFatal(e) =>
endpointRefs.remove(endpoint)
throw e
}
}
endpointRef
}
调用 DedicatedMessageLoop
有 inbox
,threadpool
:
private class DedicatedMessageLoop(
name: String,
endpoint: IsolatedRpcEndpoint,
dispatcher: Dispatcher)
extends MessageLoop(dispatcher) {
private val inbox = new Inbox(name, endpoint)
override protected val threadpool = if (endpoint.threadCount() > 1) {
ThreadUtils.newDaemonCachedThreadPool(s"dispatcher-$name", endpoint.threadCount())
} else {
ThreadUtils.newDaemonSingleThreadExecutor(s"dispatcher-$name")
}
}
追踪 Inbox
收件箱代码
private[netty] class Inbox(val endpointName: String, val endpoint: RpcEndpoint) extends Logging {
inbox =>
@GuardedBy("this")
protected val messages = new java.util.LinkedList[InboxMessage]()
...
// OnStart 表示消息,发给自己
inbox.synchronized {
messages.add(OnStart)
}
...
}
OnStart
方法就会触发 CoarseGrainedExecutorBackend
的 OnStart
方法执行:
override def onStart(): Unit = {
...
rpcEnv.asyncSetupEndpointRefByURI(driverUrl).flatMap { ref =>
// 得到 Driver
driver = Some(ref)
// 向 Driver 发送消息 RegisterExecutor,完成注册
ref.ask[Boolean](RegisterExecutor(executorId, self, hostname, cores, extractLogUrls,
extractAttributes, _resources, resourceProfile.id))
}(ThreadUtils.sameThread).onComplete {
case Success(_) =>
self.send(RegisteredExecutor)
case Failure(e) =>
exitExecutor(1, s"Cannot register with driver: $driverUrl", e, notifyDriver = false)
}(ThreadUtils.sameThread)
}
Driver
中肯定会有接收端,来接收请求,双击 Shift
查找 SparkContext
:
// 通信后台
private var _schedulerBackend: SchedulerBackend = _
追踪 SchedulerBackend
private[spark] trait SchedulerBackend {
...
}
Control + Alt +B
class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: RpcEnv)
extends ExecutorAllocationClient with SchedulerBackend with Logging {
...
class DriverEndpoint extends IsolatedRpcEndpoint with Logging {
override def onStart(): Unit = {
// onStart 方法
...
}
...
override def receive: PartialFunction[Any, Unit] = {
// 接收消息的方法
...
}
...
override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = {
// 接收和回复消息的方法,通过模式匹配接收 RegisterExecutor 消息
case RegisterExecutor(executorId, executorRef, hostname, cores, logUrls,
attributes, resources, resourceProfileId) =>
context.reply(true) // 经过一系列操作,返回注册成功
...
}
}
...
}
CoarseGrainedExecutorBackend
的 OnStart
方法,如果接收到返回成功,就会给自己发送消息:
override def onStart(): Unit = {
...
rpcEnv.asyncSetupEndpointRefByURI(driverUrl).flatMap { ref =>
// 得到 Driver
driver = Some(ref)
// 向 Driver 发送消息 RegisterExecutor,完成注册
ref.ask[Boolean](RegisterExecutor(executorId, self, hostname, cores, extractLogUrls,
extractAttributes, _resources, resourceProfile.id))
}(ThreadUtils.sameThread).onComplete {
case Success(_) =>
// 如果成功,自己给自己发送一条消息,说明注册成功
self.send(RegisteredExecutor)
case Failure(e) =>
exitExecutor(1, s"Cannot register with driver: $driverUrl", e, notifyDriver = false)
}(ThreadUtils.sameThread)
}
注册成功后, 会接收消息,调用 receive
方法:
override def receive: PartialFunction[Any, Unit] = {
// 模式匹配,注册成功消息
case RegisteredExecutor =>
logInfo("Successfully registered with driver")
try {
// 实例化 Executor
executor = new Executor(executorId, hostname, env, userClassPath, isLocal = false,
resources = _resources)
driver.get.send(LaunchedExecutor(executorId))
} catch {
case NonFatal(e) =>
exitExecutor(1, "Unable to create executor due to " + e.getMessage, e)
}
...
}
此时,已经完成了 SparkSubmit
的资源环境申请准备。但是 Driver
会继续执行。
private def runDriver(): Unit = {
addAmIpFilter(None, System.getenv(ApplicationConstants.APPLICATION_WEB_PROXY_BASE_ENV))
// 启动用户的应用程序
userClassThread = startUserApplication()
...
val totalWaitTime = sparkConf.get(AM_MAX_WAIT_TIME)
try {
val sc = ThreadUtils.awaitResult(sparkContextPromise.future,
Duration(totalWaitTime, TimeUnit.MILLISECONDS))
if (sc != null) {
val rpcEnv = sc.env.rpcEnv
val userConf = sc.getConf
val host = userConf.get(DRIVER_HOST_ADDRESS)
val port = use rConf.get(DRIVER_PORT)
// 注册AM,申请资源
registerAM(host, port, userConf, sc.ui.map(_.webUrl), appAttemptId)
val driverRef = rpcEnv.setupEndpointRef(
RpcAddress(host, port),
YarnSchedulerBackend.ENDPOINT_NAME)
// 创建分配器
createAllocator(driverRef, userConf, rpcEnv, appAttemptId, distCacheConf)
} else {
throw new IllegalStateException("User did not initialize spark context!")
}
// Driver 继续执行
resumeDriver()
userClassThread.join()
} catch {
...
} finally {
resumeDriver()
}
}
在 SparkContext
中
// 表明 准备工作已经做好
_taskScheduler.postStartHook()
// 代码追踪
def postStartHook(): Unit = { }
// Control + Alt + B
override def postStartHook(): Unit = {
waitBackendReady()
}
// 针对 waitBackendReady 进行代码追踪,状态循环等待, 使 Driver 程序继续执行
private def waitBackendReady(): Unit = {
if (backend.isReady) {
return
}
while (!backend.isReady) {
// Might take a while for backend to be ready if it is waiting on resources.
if (sc.stopped.get) {
// For example: the master removes the application for some reason
throw new IllegalStateException("Spark context stopped while waiting for backend")
}
synchronized {
this.wait(100)
}
}
}
保持 Driber
继续执行之后,程序会继续执行后续的逻辑代码,比如说:WordCount
3.名词解析
RPCEnv
:通信环境
Backend
:后台
Endpoint
:终端