ExecutionGraph生成过程
StreamGraph和JobGraph都是在client生成的,这篇文章将描述如何生成ExecutionGraph以及物理执行图。同时会讲解一个作业提交后如何被调度和执行。
client生成JobGraph之后,就通过submitJob提交至JobMaster。
在其构造函数中,会生成ExecutionGraph:
this.executionGraph = ExecutionGraphBuilder.buildGraph(...)
看下这个方法,比较长,略过了一些次要的代码片断:
// 流式作业中,schedule mode固定是EAGER的
executionGraph.setScheduleMode(jobGraph.getScheduleMode());
executionGraph.setQueuedSchedulingAllowed(jobGraph.getAllowQueuedScheduling());
<span class="hljs-comment">// 设置json plan</span>
<span class="hljs-comment">// ...</span>
<span class="hljs-comment">// 检查executableClass(即operator类),设置最大并发</span>
<span class="hljs-comment">// ...</span>
<span class="hljs-comment">// 按拓扑顺序,获取所有的JobVertex列表</span>
List<JobVertex> sortedTopology = jobGraph.getVerticesSortedTopologicallyFromSources();
<span class="hljs-comment">// 根据JobVertex列表,生成execution graph</span>
executionGraph.attachJobGraph(sortedTopology);
<span class="hljs-comment">// checkpoint检查</span></code></pre>
可以看到,生成execution graph的代码,主要是在最后一行,即ExecutionGraph.attachJobGraph方法:
public void attachJobGraph(List<JobVertex> topologiallySorted) throws JobException, IOException {
// 遍历job vertex
for (JobVertex jobVertex : topologiallySorted) {
// 根据每一个job vertex,创建对应的ExecutionVertex
ExecutionJobVertex ejv = new ExecutionJobVertex(this, jobVertex, 1, rpcCallTimeout, createTimestamp);
// 将创建的ExecutionJobVertex与前置的IntermediateResult连接起来
ejv.connectToPredecessors(this.intermediateResults);
ExecutionJobVertex previousTask = <span class="hljs-keyword">this</span>.tasks.putIfAbsent(jobVertex.getID(), ejv);
<span class="hljs-comment">// sanity check</span>
<span class="hljs-comment">// ...</span>
<span class="hljs-keyword">this</span>.verticesInCreationOrder.add(ejv);
}
}</code></pre>
可以看到,创建ExecutionJobVertex的重点就在它的构造函数中:
// 上面是并行度相关的设置
<span class="hljs-comment">// 序列化后的TaskInformation,这个信息很重要</span>
<span class="hljs-comment">// 后面deploy的时候会将TaskInformation分发到具体的Task中。</span>
<span class="hljs-keyword">this</span>.serializedTaskInformation = <span class="hljs-keyword">new</span> SerializedValue<>(<span class="hljs-keyword">new</span> TaskInformation(
jobVertex.getID(),
jobVertex.getName(),
parallelism,
maxParallelism,
<span class="hljs-comment">// 这个就是Task将要执行的Operator的类名</span>
jobVertex.getInvokableClassName(),
jobVertex.getConfiguration()));
<span class="hljs-comment">// ExecutionVertex列表,按照JobVertex并行度设置 </span>
<span class="hljs-keyword">this</span>.taskVertices = <span class="hljs-keyword">new</span> ExecutionVertex[numTaskVertices];
<span class="hljs-keyword">this</span>.inputs = <span class="hljs-keyword">new</span> ArrayList<>(jobVertex.getInputs().size());
<span class="hljs-comment">// slot sharing和coLocation相关代码</span>
<span class="hljs-comment">// ...</span>
<span class="hljs-comment">// 创建intermediate results,这是由当前operator的出度确定的,如果当前operator只向下游一个operator输出,则为1</span>
<span class="hljs-comment">// 注意一个IntermediateResult包含多个IntermediateResultPartition</span>
<span class="hljs-keyword">this</span>.producedDataSets = <span class="hljs-keyword">new</span> IntermediateResult[jobVertex.getNumberOfProducedIntermediateDataSets()];
<span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> i = <span class="hljs-number">0</span>; i < jobVertex.getProducedDataSets().size(); i++) {
<span class="hljs-keyword">final</span> IntermediateDataSet result = jobVertex.getProducedDataSets().get(i);
<span class="hljs-keyword">this</span>.producedDataSets[i] = <span class="hljs-keyword">new</span> IntermediateResult(
result.getId(),
<span class="hljs-keyword">this</span>,
numTaskVertices,
result.getResultType());
}
<span class="hljs-comment">// 根据job vertex的并行度,创建对应的ExecutionVertex列表。</span>
<span class="hljs-comment">// 即,一个JobVertex/ExecutionJobVertex代表的是一个operator,而</span>
<span class="hljs-comment">// 具体的ExecutionVertex则代表了每一个Task</span>
<span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> i = <span class="hljs-number">0</span>; i < numTaskVertices; i++) {
ExecutionVertex vertex = <span class="hljs-keyword">new</span> ExecutionVertex(
<span class="hljs-keyword">this</span>, i, <span class="hljs-keyword">this</span>.producedDataSets, timeout, createTimestamp, maxPriorAttemptsHistoryLength);
<span class="hljs-keyword">this</span>.taskVertices[i] = vertex;
}
<span class="hljs-comment">// sanity check</span>
<span class="hljs-comment">// ...</span>
<span class="hljs-comment">// set up the input splits, if the vertex has any</span>
<span class="hljs-comment">// 这是batch相关的代码</span>
<span class="hljs-comment">// ...</span>
finishedSubtasks = <span class="hljs-keyword">new</span> <span class="hljs-keyword">boolean</span>[parallelism];</code></pre>
ExecutionJobVertex和ExecutionVertex是创建完了,但是ExecutionEdge还没有创建呢,接下来看一下attachJobGraph
方法中这一行代码:
ejv.connectToPredecessors(this.intermediateResults);
这个方法代码如下:
// 获取输入的JobEdge列表
List<JobEdge> inputs = jobVertex.getInputs();
<span class="hljs-comment">// 遍历每条JobEdge </span>
<span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> num = <span class="hljs-number">0</span>; num < inputs.size(); num++) {
JobEdge edge = inputs.get(num);
<span class="hljs-comment">// 获取当前JobEdge的输入所对应的IntermediateResult</span>
IntermediateResult ires = intermediateDataSets.get(edge.getSourceId());
<span class="hljs-keyword">if</span> (ires == <span class="hljs-keyword">null</span>) {
<span class="hljs-keyword">throw</span> <span class="hljs-keyword">new</span> JobException(<span class="hljs-string">"Cannot connect this job graph to the previous graph. No previous intermediate result found for ID "</span>
+ edge.getSourceId());
}
<span class="hljs-comment">// 将IntermediateResult加入到当前ExecutionJobVertex的输入中。</span>
<span class="hljs-keyword">this</span>.inputs.add(ires);
<span class="hljs-comment">// 为IntermediateResult注册consumer</span>
<span class="hljs-comment">// consumerIndex跟IntermediateResult的出度相关</span>
<span class="hljs-keyword">int</span> consumerIndex = ires.registerConsumer();
<span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> i = <span class="hljs-number">0</span>; i < parallelism; i++) {
ExecutionVertex ev = taskVertices[i];
<span class="hljs-comment">// 将ExecutionVertex与IntermediateResult关联起来</span>
ev.connectSource(num, ires, edge, consumerIndex);
}
}</code></pre>
看下ExecutionVertex.connectSource
方法代码:
public void connectSource(int inputNumber, IntermediateResult source, JobEdge edge, int consumerNumber) {
<span class="hljs-comment">// 只有forward的方式的情况下,pattern才是POINTWISE的,否则均为ALL_TO_ALL</span>
<span class="hljs-keyword">final</span> DistributionPattern pattern = edge.getDistributionPattern();
<span class="hljs-keyword">final</span> IntermediateResultPartition[] sourcePartitions = source.getPartitions();
ExecutionEdge[] edges;
<span class="hljs-keyword">switch</span> (pattern) {
<span class="hljs-keyword">case</span> POINTWISE:
edges = connectPointwise(sourcePartitions, inputNumber);
<span class="hljs-keyword">break</span>;
<span class="hljs-keyword">case</span> ALL_TO_ALL:
edges = connectAllToAll(sourcePartitions, inputNumber);
<span class="hljs-keyword">break</span>;
<span class="hljs-keyword">default</span>:
<span class="hljs-keyword">throw</span> <span class="hljs-keyword">new</span> RuntimeException(<span class="hljs-string">"Unrecognized distribution pattern."</span>);
}
<span class="hljs-keyword">this</span>.inputEdges[inputNumber] = edges;
<span class="hljs-comment">// 之前已经为IntermediateResult添加了consumer,这里为IntermediateResultPartition添加consumer,即关联到ExecutionEdge上</span>
<span class="hljs-keyword">for</span> (ExecutionEdge ee : edges) {
ee.getSource().addConsumer(ee, consumerNumber);
}
}
connectAllToAll
方法:
ExecutionEdge[] edges = new ExecutionEdge[sourcePartitions.length];
<span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> i = <span class="hljs-number">0</span>; i < sourcePartitions.length; i++) {
IntermediateResultPartition irp = sourcePartitions[i];
edges[i] = <span class="hljs-keyword">new</span> ExecutionEdge(irp, <span class="hljs-keyword">this</span>, inputNumber);
}
<span class="hljs-keyword">return</span> edges;</code></pre>
看这个方法之前,需要知道,ExecutionVertex的inputEdges变量,是一个二维数据。它表示了这个ExecutionVertex上每一个input所包含的ExecutionEdge列表。
即,如果ExecutionVertex有两个不同的输入:输入A和B。其中输入A的partition=1, 输入B的partition=8,那么这个二维数组inputEdges如下(为简短,以irp代替IntermediateResultPartition)
[ ExecutionEdge[ A.irp[0]] ]
[ ExecutionEdge[ B.irp[0], B.irp[1], ..., B.irp[7] ]
所以上面的代码就很容易理解了。
到这里为止,ExecutionJobGraph就创建完成了。接下来看下这个ExecutionGraph是如何转化成Task并开始执行的。
Task调度和执行
接下来我们以最简单的mini cluster为例讲解一下Task如何被调度和执行。
简单略过client端job的提交和StreamGraph到JobGraph的翻译,以及上面ExecutionGraph的翻译。
提交后的job的流通过程大致如下:
env.execute('<job name>')
--> MiniCluster.runJobBlocking(jobGraph)
--> MiniClusterDispatcher.runJobBlocking(jobGraph)
--> MiniClusterDispatcher.startJobRunners
--> JobManagerRunner.start
--> JobMaster.<init> (build ExecutionGraph)
创建完JobMaster之后,JobMaster就会进行leader election,得到leader之后,会回调grantLeadership
方法,从而调用jobManager.start(leaderSessionID);
开始运行job。
JobMaster.start
--> JobMaster.startJobExecution(这里还没开始执行呢..)
--> resourceManagerLeaderRetriever.start(new ResourceManagerLeaderListener());
重点是在下面这行,获取到resource manage之后,就会回调ResourceManagerLeaderListener.notifyLeaderAddress
,整个调用流如下:
ResourceManagerLeaderListener.notifyLeaderAddress
--> JobMaster.notifyOfNewResourceManagerLeader
--> ResourceManagerConnection.start
--> ResourceManagerConnection.onRegistrationSuccess(callback,由flink rpc框架发送并回调)
--> JobMaster.onResourceManagerRegistrationSuccess
然后终于来到了最核心的调度代码,在JobMaster.onResourceManagerRegistrationSuccess
方法中:
executionContext.execute(new Runnable() {
@Override
public void run() {
try {
executionGraph.restoreExternalCheckpointedStore();
executionGraph.setQueuedSchedulingAllowed(true);
executionGraph.scheduleForExecution(slotPool.getSlotProvider());
}
catch (Throwable t) {
executionGraph.fail(t);
}
}
});
ExecutionGraph.scheduleForExecution --> ExecutionGraph.scheduleEager
这个方法会计算所有的ExecutionVertex总数,并为每个ExecutionVertex分配一个SimpleSlot(暂时不考虑slot sharing的情况),然后封装成ExecutionAndSlot,顾名思义,即ExecutionVertex + Slot(更为贴切地说,应该是ExecutionAttempt + Slot)。
然后调用execAndSlot.executionAttempt.deployToSlot(slot);
进行deploy,即Execution.deployToSlot
。
这个方法先会进行一系列状态迁移和检查,然后进行deploy,比较核心的代码如下:
final TaskDeploymentDescriptor deployment = vertex.createDeploymentDescriptor(
attemptId,
slot,
taskState,
attemptNumber);
<span class="hljs-comment">// register this execution at the execution graph, to receive call backs</span>
vertex.getExecutionGraph().registerExecution(<span class="hljs-keyword">this</span>);
<span class="hljs-keyword">final</span> TaskManagerGateway taskManagerGateway = slot.getTaskManagerGateway(); <span class="hljs-keyword">final</span> Future<Acknowledge> submitResultFuture = taskManagerGateway.submitTask(deployment, timeout);
ExecutionVertex.createDeploymentDescriptor方法中,包含了从Execution Graph到真正物理执行图的转换。如将IntermediateResultPartition转化成ResultPartition,ExecutionEdge转成InputChannelDeploymentDescriptor(最终会在执行时转化成InputGate)。
最后通过RPC方法提交task,实际会调用到TaskExecutor.submitTask
方法中。
这个方法会创建真正的Task,然后调用task.startTaskThread();
开始task的执行。
在Task构造函数中,会根据输入的参数,创建InputGate, ResultPartition, ResultPartitionWriter等。
而startTaskThread
方法,则会执行executingThread.start
,从而调用Task.run
方法。
它的最核心的代码如下:
// ...
<span class="hljs-comment">// now load the task's invokable code</span>
invokable = loadAndInstantiateInvokable(userCodeClassLoader, nameOfInvokableClass);
<span class="hljs-comment">// ...</span>
invokable.setEnvironment(env);
<span class="hljs-comment">// ...</span>
<span class="hljs-keyword">this</span>.invokable = invokable;
invokable.invoke();
<span class="hljs-comment">// task finishes or fails, do cleanup</span>
<span class="hljs-comment">// ...</span></code></pre>
这里的invokable即为operator对象实例,通过反射创建。具体地,即为OneInputStreamTask,或者SourceStreamTask等。这个nameOfInvokableClass是哪里生成的呢?其实早在生成StreamGraph的时候,这就已经确定了,见StreamGraph.addOperator
方法:
if (operatorObject instanceof StoppableStreamSource) {
addNode(vertexID, slotSharingGroup, StoppableSourceStreamTask.class, operatorObject, operatorName);
} else if (operatorObject instanceof StreamSource) {
addNode(vertexID, slotSharingGroup, SourceStreamTask.class, operatorObject, operatorName);
} else {
addNode(vertexID, slotSharingGroup, OneInputStreamTask.class, operatorObject, operatorName);
}
这里的OneInputStreamTask.class
即为生成的StreamNode的vertexClass。这个值会一直传递,当StreamGraph被转化成JobGraph的时候,这个值会被传递到JobVertex的invokableClass。然后当JobGraph被转成ExecutionGraph的时候,这个值被传入到ExecutionJobVertex.TaskInformation.invokableClassName中,一直传到Task中。
那么用户真正写的逻辑代码在哪里呢?比如word count中的Tokenizer,去了哪里呢?
OneInputStreamTask的基类StreamTask,包含了headOperator和operatorChain。当我们调用dataStream.flatMap(new Tokenizer())
的时候,会生成一个StreamFlatMap的operator,这个operator是一个AbstractUdfStreamOperator,而用户的代码new Tokenizer
,即为它的userFunction。
所以再串回来,以OneInputStreamTask为例,Task的核心执行代码即为OneInputStreamTask.invoke
方法,它会调用StreamTask.run
方法,这是个抽象方法,最终会调用其派生类的run方法,即OneInputStreamTask, SourceStreamTask等。
OneInputStreamTask的run方法代码如下:
final OneInputStreamOperator<IN, OUT> operator = this.headOperator;
final StreamInputProcessor<IN> inputProcessor = this.inputProcessor;
final Object lock = getCheckpointLock();
<span class="hljs-keyword">while</span> (running && inputProcessor.processInput(operator, lock)) {
<span class="hljs-comment">// all the work happens in the "processInput" method</span>
}</code></pre>
就是一直不停地循环调用inputProcessor.processInput(operator, lock)
方法,即StreamInputProcessor.processInput
方法:
public boolean processInput(OneInputStreamOperator<IN, ?> streamOperator, final Object lock) throws Exception {
// ...
<span class="hljs-keyword">while</span> (<span class="hljs-keyword">true</span>) {
<span class="hljs-keyword">if</span> (currentRecordDeserializer != <span class="hljs-keyword">null</span>) {
<span class="hljs-comment">// ...</span>
<span class="hljs-keyword">if</span> (result.isFullRecord()) {
StreamElement recordOrMark = deserializationDelegate.getInstance();
<span class="hljs-comment">// 处理watermark,则框架处理</span>
<span class="hljs-keyword">if</span> (recordOrMark.isWatermark()) {
<span class="hljs-comment">// watermark处理逻辑</span>
<span class="hljs-comment">// ...</span>
<span class="hljs-keyword">continue</span>;
} <span class="hljs-keyword">else</span> <span class="hljs-keyword">if</span>(recordOrMark.isLatencyMarker()) {
<span class="hljs-comment">// 处理latency mark,也是由框架处理</span>
<span class="hljs-keyword">synchronized</span> (lock) {
streamOperator.processLatencyMarker(recordOrMark.asLatencyMarker());
}
<span class="hljs-keyword">continue</span>;
} <span class="hljs-keyword">else</span> {
<span class="hljs-comment">// ***** 这里是真正的用户逻辑代码 *****</span>
StreamRecord<IN> record = recordOrMark.asRecord();
<span class="hljs-keyword">synchronized</span> (lock) {
numRecordsIn.inc();
streamOperator.setKeyContextElement1(record);
streamOperator.processElement(record);
}
<span class="hljs-keyword">return</span> <span class="hljs-keyword">true</span>;
}
}
}
<span class="hljs-comment">// 其他处理逻辑</span>
<span class="hljs-comment">// ...</span>
}
}</code></pre>
上面的代码中,streamOperator.processElement(record);
才是真正处理用户逻辑的代码,以StreamFlatMap为例,即为它的processElement方法:
public void processElement(StreamRecord<IN> element) throws Exception {
collector.setTimestamp(element);
userFunction.flatMap(element.getValue(), collector);
}
这样,整个调度和执行逻辑就全部串起来啦。