【源码】Flink StreamGraph 生成过程

原创

mb5fd868b989ae9 2021-02-07 14:53:43 ©著作权

文章标签 Java 文章分类 Java 后端开发

©著作权归作者所有：来自51CTO博客作者mb5fd868b989ae9的原创作品，请联系作者获取转载授权，否则将追究法律责任

Flink StreamGraph 的核心是 streamNodes 包含所以算子生成的 StreamNode(也叫 Vertex), StreamNode 中包含连接算子的边（Edge），其他的虚拟节点使用 virtualSelectNodes、virtualSideOutputNodes、virtualPartitionNodes 这三个map 标示上下游物理节点的连接信息 sources、sinks 的 set 标示流的source 和 sink 的 StreamNode id

private Map<Integer, StreamNode> streamNodes;
private Set<Integer> sources;
private Set<Integer> sinks;
private Map<Integer, Tuple2<Integer, List<String>>> virtualSelectNodes;
private Map<Integer, Tuple2<Integer, OutputTag>> virtualSideOutputNodes;
private Map<Integer, Tuple3<Integer, StreamPartitioner<?>, ShuffleMode>> virtualPartitionNodes;

StreamGraph 的生成是从用户代码执行 env.execute() 开始的，getStreamGraph(jobName) 做参数的 execute 下面，就是生成 JobGraph 的内容，这次的主角就是 getStreamGraph 方法执行的部分了

StreamExecutionEnvironment.java

public JobExecutionResult execute(String jobName) throws Exception {
    Preconditions.checkNotNull(jobName, "Streaming Job name should not be null.");

    // 使用jobName 做参数调用 getStreamGraph 生成 StreamGraph，
    // 再用 StreamGraph 做参数，调用 execute 方法 生成后续的 JobGraph
    return execute(getStreamGraph(jobName));
  }

StreamGraph 是 StreamGraphGenerator 生成的，StreamGraphGenerator 对象的创建比较简单，就是将所以执行配置都放进去，除了部署属性 env.configuration(DeploymentOptions)

transformations 即是所有算子转换的 Transformation 列表 config 是 ExecutionConfig 其他的很明显，就不一一说明了

StreamExecuteEnvironment.java

@Internal
public StreamGraph getStreamGraph(String jobName, boolean clearTransformations) {
  // 先创建 StreamGraphGenerator, 再调用 generate 生成 StreamGraph
  StreamGraph streamGraph = getStreamGraphGenerator().setJobName(jobName).generate();
  if (clearTransformations) {p
    this.transformations.clear();
  }
  return streamGraph;
}

private StreamGraphGenerator getStreamGraphGenerator() {
  if (transformations.size() <= 0) {
    throw new IllegalStateException("No operators defined in streaming topology. Cannot execute.");
  }
  // 创建 StreamGraphGenerator， 将 transformations/config/checkpointCfg/stateBackend 等配置信息放进去
  return new StreamGraphGenerator(transformations, config, checkpointCfg)
    .setStateBackend(defaultStateBackend)
    .setChaining(isChainingEnabled)
    .setUserArtifacts(cacheFile)
    // 时间类型
    .setTimeCharacteristic(timeCharacteristic)
    .setDefaultBufferTimeout(bufferTimeout);
}

generate 过程也比较粗暴，直接遍历 transformations 列表，将每个 transform 都再 transform 一遍，从 Transformation 转成 StreamNode

StreamGraphGenerator.java

// generate
public StreamGraph generate() {
  // 先创建 StreamGraph 将 env 中的 配置信息全部放进去
  streamGraph = new StreamGraph(executionConfig, checkpointConfig, savepointRestoreSettings);
  streamGraph.setStateBackend(stateBackend);
  streamGraph.setChaining(chaining);
  streamGraph.setScheduleMode(scheduleMode);
  streamGraph.setUserArtifacts(userArtifacts);
  streamGraph.setTimeCharacteristic(timeCharacteristic);
  streamGraph.setJobName(jobName);
  streamGraph.setGlobalDataExchangeMode(globalDataExchangeMode);

  alreadyTransformed = new HashMap<>();
  // 遍历 transformation 列表，对所以算子做 转换
  for (Transformation<?> transformation: transformations) {
    // 生成 StreamGraph 的核心逻辑
    transform(transformation);
  }
  // 返回 final 的对象，后面就不能修改了
  final StreamGraph builtStreamGraph = streamGraph;

  alreadyTransformed.clear();
  alreadyTransformed = null;
  streamGraph = null;
  // 返回 生成的 StreamGraph
  return builtStreamGraph;
}

transform 方法的内容比较重要，涉及到所以算子的转换，不同类型的 transform 调用不同的方法

物理节点和虚拟节点(分区、侧输出、select) 逻辑不同物理节点的处理是创建 StreamNode（vertex），设置虚拟化器，指定输入输出类型，设置 keySelector、并行度、最大并行度、添加输入边虚拟节点没有 StreamNode 只包含上下游物理节点的连接关系

private Collection<Integer> transform(Transformation<?> transform) {

    // transform: 类型 + id + name + outputType + partition
    // 判断 转换操作是否已经添加了，添加了就返回 input 的 node id
    if (alreadyTransformed.containsKey(transform)) {
      return alreadyTransformed.get(transform);
    }

    LOG.debug("Transforming " + transform);
    // 设置最大并行度
    if (transform.getMaxParallelism() <= 0) {

      // if the max parallelism hasn't been set, then first use the job wide max parallelism
      // from the ExecutionConfig.
      int globalMaxParallelismFromConfig = executionConfig.getMaxParallelism();
      // 最大并行度大于 0 的情况，就设置最大并行度（否则会使用默认值
      if (globalMaxParallelismFromConfig > 0) {
        transform.setMaxParallelism(globalMaxParallelismFromConfig);
      }
    }
    // 校验 输出类型, 如果没有类型 (MissingTypeInfo) 的 抱错
    // call at least once to trigger exceptions about MissingTypeInfo
    transform.getOutputType();

    // 处理不同类型的 transform
    Collection<Integer> transformedIds;
    if (transform instanceof OneInputTransformation<?, ?>) {
      // 只有一个输入
      transformedIds = transformOneInputTransform((OneInputTransformation<?, ?>) transform);
    } else if (transform instanceof TwoInputTransformation<?, ?, ?>) {
      // 两个输入
      transformedIds = transformTwoInputTransform((TwoInputTransformation<?, ?, ?>) transform);
    } else if (transform instanceof AbstractMultipleInputTransformation<?>) {
      // 多个输入
      transformedIds = transformMultipleInputTransform((AbstractMultipleInputTransformation<?>) transform);
    } else if (transform instanceof SourceTransformation) {
      // source
      transformedIds = transformSource((SourceTransformation<?>) transform);
    } else if (transform instanceof LegacySourceTransformation<?>) {
      // source
      transformedIds = transformLegacySource((LegacySourceTransformation<?>) transform);
    } else if (transform instanceof SinkTransformation<?>) {
      // sink
      transformedIds = transformSink((SinkTransformation<?>) transform);
    } else if (transform instanceof UnionTransformation<?>) {
      // union
      transformedIds = transformUnion((UnionTransformation<?>) transform);
    } else if (transform instanceof SplitTransformation<?>) {
      // split
      transformedIds = transformSplit((SplitTransformation<?>) transform);
    } else if (transform instanceof SelectTransformation<?>) {
      // select
      transformedIds = transformSelect((SelectTransformation<?>) transform);
    } else if (transform instanceof FeedbackTransformation<?>) {
      // feedback
      transformedIds = transformFeedback((FeedbackTransformation<?>) transform);
    } else if (transform instanceof CoFeedbackTransformation<?>) {
      // co feedback
      transformedIds = transformCoFeedback((CoFeedbackTransformation<?>) transform);
    } else if (transform instanceof PartitionTransformation<?>) {
      // 分区
      transformedIds = transformPartition((PartitionTransformation<?>) transform);
    } else if (transform instanceof SideOutputTransformation<?>) {
      // 侧输出
      transformedIds = transformSideOutput((SideOutputTransformation<?>) transform);
    } else {
      // 其他
      throw new IllegalStateException("Unknown transformation: " + transform);
    }
    // 添加 transform 到 已经 transform 的 map 中
    // need this check because the iterate transformation adds itself before
    // transforming the feedback edges
    if (!alreadyTransformed.containsKey(transform)) {
      alreadyTransformed.put(transform, transformedIds);
    }
    // 设置 buffer timeout
    if (transform.getBufferTimeout() >= 0) {
      streamGraph.setBufferTimeout(transform.getId(), transform.getBufferTimeout());
    } else {
      streamGraph.setBufferTimeout(transform.getId(), defaultBufferTimeout);
    }
    // transform 设置 UID
    if (transform.getUid() != null) {
      streamGraph.setTransformationUID(transform.getId(), transform.getUid());
    }
    // 设置 node hash
    if (transform.getUserProvidedNodeHash() != null) {
      streamGraph.setTransformationUserHash(transform.getId(), transform.getUserProvidedNodeHash());
    }
    //
    if (!streamGraph.getExecutionConfig().hasAutoGeneratedUIDsEnabled()) {
      if (transform instanceof PhysicalTransformation &&
          transform.getUserProvidedNodeHash() == null &&
          transform.getUid() == null) {
        throw new IllegalStateException("Auto generated UIDs have been disabled " +
          "but no UID or hash has been assigned to operator " + transform.getName());
      }
    }
    // 设置 streamNode 资源 ：最小资源、最优资源 包含 cpuCores/taskHeapMemory/taskOffHeapMemory/managedMemory/extendedResources
    if (transform.getMinResources() != null && transform.getPreferredResources() != null) {
      streamGraph.setResources(transform.getId(), transform.getMinResources(), transform.getPreferredResources());
    }
    // 设置 managedMemory 权重
    streamGraph.setManagedMemoryWeight(transform.getId(), transform.getManagedMemoryWeight());

    return transformedIds;
  }
	```
###   ##ource 算子 transform 过程
source 算子 transform 过程也很明显，直接调用 streamGraph.addSource 方法，将 source id 、slotSharingGroup 、 输出类型等 做为参数，生成 Source 的 StreamNode

else if (transform instanceof SourceTransformation) { // source transformedIds = transformSource((SourceTransformation<?>) transform); }

private <T> Collection<Integer> transformSource(SourceTransformation<T> source) { // 获取 slotSharingGroup String slotSharingGroup = determineSlotSharingGroup(source.getSlotSharingGroup(), Collections.emptyList());

// 添加 source
streamGraph.addSource(source.getId(),
    slotSharingGroup,
    source.getCoLocationGroupKey(),
    source.getOperatorFactory(),
    null,
    source.getOutputType(),
    "Source: " + source.getName());
int parallelism = source.getParallelism() != ExecutionConfig.PARALLELISM_DEFAULT ?
    source.getParallelism() : executionConfig.getParallelism();
// 设置并行度
streamGraph.setParallelism(source.getId(), parallelism);
// 设置最大并行度
streamGraph.setMaxParallelism(source.getId(), source.getMaxParallelism());
// 返回 source vertexID
return Collections.singleton(source.getId());

} ``` addSource 添加一个Source 的StreamNode，同时将生成的 StreamNode ID （vertexId）放到 sources Set 中

public <IN, OUT> void addSource(
    Integer vertexID,
    @Nullable String slotSharingGroup,
    @Nullable String coLocationGroup,
    SourceOperatorFactory<OUT> operatorFactory,
    TypeInformation<IN> inTypeInfo,
    TypeInformation<OUT> outTypeInfo,
    String operatorName) {
    // 添加一个 operator
    addOperator(
      vertexID,
      slotSharingGroup,
      coLocationGroup,
      operatorFactory,
      inTypeInfo,
      outTypeInfo,
      operatorName,
      SourceOperatorStreamTask.class);
    // 添加到 source set 中
    sources.add(vertexID);
  }
	```
addOperator 方法其他算子调用的也差不多了

public <IN, OUT> void addOperator( Integer vertexID, @Nullable String slotSharingGroup, @Nullable String coLocationGroup, StreamOperatorFactory<OUT> operatorFactory, TypeInformation<IN> inTypeInfo, TypeInformation<OUT> outTypeInfo, String operatorName) { // 反射获取算子Task 的类型 Class<? extends AbstractInvokable> invokableClass = operatorFactory.isStreamSource() ? SourceStreamTask.class : OneInputStreamTask.class; // 添加算子，转换成了算子 addOperator(vertexID, slotSharingGroup, coLocationGroup, operatorFactory, inTypeInfo, outTypeInfo, operatorName, invokableClass); }

private <IN, OUT> void addOperator( Integer vertexID, @Nullable String slotSharingGroup, @Nullable String coLocationGroup, StreamOperatorFactory<OUT> operatorFactory, TypeInformation<IN> inTypeInfo, TypeInformation<OUT> outTypeInfo, String operatorName, Class<? extends AbstractInvokable> invokableClass) { // 用算子创建 StreamNode 并添加到 StreamGraph 的核心 StreamNodes 中 addNode(vertexID, slotSharingGroup, coLocationGroup, invokableClass, operatorFactory, operatorName); // 设置 StreamNode 输入、输出的序列化类型 setSerializers(vertexID, createSerializer(inTypeInfo), null, createSerializer(outTypeInfo)); // StreamOperator 工厂类如果指定了输出类型配置 if (operatorFactory.isOutputTypeConfigurable() && outTypeInfo != null) { // sets the output type which must be know at StreamGraph creation time operatorFactory.setOutputType(outTypeInfo, executionConfig); }

// StreamOperator 工厂类 如果指定了 输入类型配置
if (operatorFactory.isInputTypeConfigurable()) {
  operatorFactory.setInputType(inTypeInfo, executionConfig);
}

if (LOG.isDebugEnabled()) {
  LOG.debug("Vertex: {}", vertexID);
}

} ``` addNode 创建 StreamNode，将 StreamNode 添加到 streamNodes 列表中

// 算子创建 StreamNode，并加入 StreamGraph 的 StreamNodes 中
  protected StreamNode addNode(
    Integer vertexID,
    @Nullable String slotSharingGroup,
    @Nullable String coLocationGroup,
    Class<? extends AbstractInvokable> vertexClass,
    StreamOperatorFactory<?> operatorFactory,
    String operatorName) {

    // 如果已经存在，说明已经处理过这个节点了，任务出现错误 抛出 RuntimeException
    if (streamNodes.containsKey(vertexID)) {
      throw new RuntimeException("Duplicate vertexID " + vertexID);
    }

    // new StreamNode
    StreamNode vertex = new StreamNode(
      vertexID,
      slotSharingGroup,
      coLocationGroup,
      operatorFactory,
      operatorName,
      new ArrayList<OutputSelector<?>>(),
      vertexClass);
    // 添加到 streamNodes 中
    streamNodes.put(vertexID, vertex);

    return vertex;
  }
	```
	
###    ##物理节点 OneInputTransformation transform 过程
创建 StreamNode，添加输入边，添加到 streamNodes，返回 vertexId

if (transform instanceof OneInputTransformation<?, ?>) { // 只有一个输入 transformedIds = transformOneInputTransform((OneInputTransformation<?, ?>) transform); }

/**

Transforms a {@code OneInputTransformation}.
一个输入的算子
<p>This recursively transforms the inputs, creates a new {@code StreamNode} in the graph and
wired the inputs to this new node. */ private <IN, OUT> Collection<Integer> transformOneInputTransform(OneInputTransformation<IN, OUT> transform) {

// 把输入 的 transform 放进去 transform 一下
// 有多个上游输入算子处理的时候，以防其他分支还没有处理
Collection<Integer> inputIds = transform(transform.getInput());

// 检查是否已经添加了
// the recursive call might have already transformed this
if (alreadyTransformed.containsKey(transform)) {
  return alreadyTransformed.get(transform);
}

// 获取 slotSharingGroup, 输入或者默认
String slotSharingGroup = determineSlotSharingGroup(transform.getSlotSharingGroup(), inputIds);

// 添加 转换到 streamNodes 中
streamGraph.addOperator(transform.getId(),
    slotSharingGroup,
    transform.getCoLocationGroupKey(),
    transform.getOperatorFactory(),
    transform.getInputType(),
    transform.getOutputType(),
    transform.getName());
// 判断是 keyby 的 KeySelector
if (transform.getStateKeySelector() != null) {
  // 可以 序列化器
  TypeSerializer<?> keySerializer = transform.getStateKeyType().createSerializer(executionConfig);
  // 设置 一个输入的 state key 的 序列化器 和 KeySelector
  streamGraph.setOneInputStateKey(transform.getId(), transform.getStateKeySelector(), keySerializer);
}
// 获取并行度
int parallelism = transform.getParallelism() != ExecutionConfig.PARALLELISM_DEFAULT ?
  transform.getParallelism() : executionConfig.getParallelism();
// 设置并行度
streamGraph.setParallelism(transform.getId(), parallelism);
// 设置 最大并行度
streamGraph.setMaxParallelism(transform.getId(), transform.getMaxParallelism());
// 给 StreamNode 每个输入添加 边
for (Integer inputId: inputIds) {
  streamGraph.addEdge(inputId, transform.getId(), 0);
}
// 返回 节点 transform 的 id 也是 vertexID
return Collections.singleton(transform.getId());

}

这一段逻辑比较清晰，就不多废话了， addOperator 与 Source 的差不多

###    ##虚拟分区节点 PartitionTransformation transform 过程

else if (transform instanceof PartitionTransformation<?>) { // 分区 transformedIds = transformPartition((PartitionTransformation<?>) transform); }

private <T> Collection<Integer> transformPartition(PartitionTransformation<T> partition) { // 获取输入 Transformation<T> input = partition.getInput(); List<Integer> resultIds = new ArrayList<>(); // transform 输入 Collection<Integer> transformedIds = transform(input); // 对每个输入添加一个虚拟分区节点 for (Integer transformedId: transformedIds) { int virtualId = Transformation.getNewNodeId(); // 添加细腻分区节点 streamGraph.addVirtualPartitionNode( transformedId, virtualId, partition.getPartitioner(), partition.getShuffleMode()); // 添加到返回的 resultId 列表中 resultIds.add(virtualId); }

return resultIds;

}

public void addVirtualPartitionNode( Integer originalId, Integer virtualId, StreamPartitioner<?> partitioner, ShuffleMode shuffleMode) {

// 查看是否已经添加了
if (virtualPartitionNodes.containsKey(virtualId)) {
  throw new IllegalStateException("Already has virtual partition node with id " + virtualId);
}
// 添加
virtualPartitionNodes.put(virtualId, new Tuple3<>(originalId, partitioner, shuffleMode));

}


###     ##union 算子 transform 过程
只是把所以 输入都 transform 了一遍，其他就没有做了，union 算子不会创建节点， union 的每个流会单独处理，直接与下游 节点相连，而不是先合并，再关联下游节点（从webUI 连线也能看出来）

else if (transform instanceof UnionTransformation<?>) { // union transformedIds = transformUnion((UnionTransformation<?>) transform); }

/**

Transforms a {@code UnionTransformation}.
<p>This is easy, we only have to transform the inputs and return all the IDs in a list so
that downstream operations can connect to all upstream nodes.
这很容易，我们只需要转换输入并返回列表中的所有ID，以便下游操作可以连接到所有上游节点。 */ private <T> Collection<Integer> transformUnion(UnionTransformation<T> union) { List<Transformation<T>> inputs = union.getInputs(); List<Integer> resultIds = new ArrayList<>();

for (Transformation<T> input: inputs) { resultIds.addAll(transform(input)); }

return resultIds; }


###     ## sink 算子 transform 过程

else if (transform instanceof SinkTransformation<?>) { // sink transformedIds = transformSink((SinkTransformation<?>) transform); }

private <T> Collection<Integer> transformSink(SinkTransformation<T> sink) { // transform sink 算子的输入算子 Collection<Integer> inputIds = transform(sink.getInput()); // 决定 slotSharingGroup String slotSharingGroup = determineSlotSharingGroup(sink.getSlotSharingGroup(), inputIds); // 添加 Sink streamGraph.addSink(sink.getId(), slotSharingGroup, sink.getCoLocationGroupKey(), sink.getOperatorFactory(), sink.getInput().getOutputType(), null, "Sink: " + sink.getName()); // 设置 sink 的 StreamOperatorFactory StreamOperatorFactory operatorFactory = sink.getOperatorFactory(); if (operatorFactory instanceof OutputFormatOperatorFactory) { streamGraph.setOutputFormat(sink.getId(), ((OutputFormatOperatorFactory) operatorFactory).getOutputFormat()); } // 设置并行度与最大并行度 int parallelism = sink.getParallelism() != ExecutionConfig.PARALLELISM_DEFAULT ? sink.getParallelism() : executionConfig.getParallelism(); streamGraph.setParallelism(sink.getId(), parallelism); streamGraph.setMaxParallelism(sink.getId(), sink.getMaxParallelism()); // sink 算子添加输入边 for (Integer inputId: inputIds) { streamGraph.addEdge(inputId, sink.getId(), 0); } // 设置 keySelector if (sink.getStateKeySelector() != null) { TypeSerializer<?> keySerializer = sink.getStateKeyType().createSerializer(executionConfig); streamGraph.setOneInputStateKey(sink.getId(), sink.getStateKeySelector(), keySerializer); } // 返回空这个分支就结束了 return Collections.emptyList(); }

public <IN, OUT> void addSink( Integer vertexID, @Nullable String slotSharingGroup, @Nullable String coLocationGroup, StreamOperatorFactory<OUT> operatorFactory, TypeInformation<IN> inTypeInfo, TypeInformation<OUT> outTypeInfo, String operatorName) { // 添加 sink StreamNode addOperator(vertexID, slotSharingGroup, coLocationGroup, operatorFactory, inTypeInfo, outTypeInfo, operatorName); // 添加 Sink StreamNode id 到 sinks Set 中 sinks.add(vertexID); } ``` 到这里，从 Source 到 Sink 的 transform 过程就结束了，略微总结下：

1、source StreamNode 没有输入，会添加到 streamNodes 和 sources 中 2、Sink StreamNode 不返回，即没有下游，会添加到 streamNodes 和 sinks 中 3、物理节点会创建 StreamNode 添加到 streamNodes 中 4、虚拟节点不会创建 StreamNode 5、union 算子是没有节点的，只返回 union 输入算子的 id