当完成DataStream的配置后,调用其中环境上下文StreamExecutionEnvironment的getStreamGrahph()方法即可生成关于该DataStream的streamGraph。
@Internal
public StreamGraph getStreamGraph() {
if (transformations.size() <= 0) {
throw new IllegalStateException("No operators defined in streaming topology. Cannot execute.");
}
return StreamGraphGenerator.generate(this, transformations);
}
public static StreamGraph generate(StreamExecutionEnvironment env, List<StreamTransformation<?>> transformations) {
return new StreamGraphGenerator(env).generateInternal(transformations);
}
生成streamGraph实则是在StreamGraphGenerator的generate()方法中,参数除了调用的env本身之外还需要env当中表示被转换的DataStream中所有操作streamTransformation的数组。
之后会在generate()方法中会生成一个StreamGraphGenerator并调用generateInternal()方法根据DataStream中的所有transformation开始生成streamGraph。
private StreamGraph generateInternal(List<StreamTransformation<?>> transformations) {
for (StreamTransformation<?> transformation: transformations) {
transform(transformation);
}
return streamGraph;
}
此处会遍历所有的streamTransformation,依次调用transform()方法来转换至streamGraph。
在transform()方法中,会根据所有类型的transform分别处理,具体的代码如下。
if (transform instanceof OneInputTransformation<?, ?>) {
transformedIds = transformOneInputTransform((OneInputTransformation<?, ?>) transform);
} else if (transform instanceof TwoInputTransformation<?, ?, ?>) {
transformedIds = transformTwoInputTransform((TwoInputTransformation<?, ?, ?>) transform);
} else if (transform instanceof SourceTransformation<?>) {
transformedIds = transformSource((SourceTransformation<?>) transform);
} else if (transform instanceof SinkTransformation<?>) {
transformedIds = transformSink((SinkTransformation<?>) transform);
} else if (transform instanceof UnionTransformation<?>) {
transformedIds = transformUnion((UnionTransformation<?>) transform);
} else if (transform instanceof SplitTransformation<?>) {
transformedIds = transformSplit((SplitTransformation<?>) transform);
} else if (transform instanceof SelectTransformation<?>) {
transformedIds = transformSelect((SelectTransformation<?>) transform);
} else if (transform instanceof FeedbackTransformation<?>) {
transformedIds = transformFeedback((FeedbackTransformation<?>) transform);
} else if (transform instanceof CoFeedbackTransformation<?>) {
transformedIds = transformCoFeedback((CoFeedbackTransformation<?>) transform);
} else if (transform instanceof PartitionTransformation<?>) {
transformedIds = transformPartition((PartitionTransformation<?>) transform);
} else if (transform instanceof SideOutputTransformation<?>) {
transformedIds = transformSideOutput((SideOutputTransformation<?>) transform);
} else {
throw new IllegalStateException("Unknown transformation: " + transform);
}
以一个map操作节点为例子,如果是map类型的transformation,那么将会在此处通过transformOneInputTransform()方法构造成streamGraph中的streamNode。
private <IN, OUT> Collection<Integer> transformOneInputTransform(OneInputTransformation<IN, OUT> transform) {
Collection<Integer> inputIds = transform(transform.getInput());
// the recursive call might have already transformed this
if (alreadyTransformed.containsKey(transform)) {
return alreadyTransformed.get(transform);
}
String slotSharingGroup = determineSlotSharingGroup(transform.getSlotSharingGroup(), inputIds);
streamGraph.addOperator(transform.getId(),
slotSharingGroup,
transform.getCoLocationGroupKey(),
transform.getOperator(),
transform.getInputType(),
transform.getOutputType(),
transform.getName());
if (transform.getStateKeySelector() != null) {
TypeSerializer<?> keySerializer = transform.getStateKeyType().createSerializer(env.getConfig());
streamGraph.setOneInputStateKey(transform.getId(), transform.getStateKeySelector(), keySerializer);
}
streamGraph.setParallelism(transform.getId(), transform.getParallelism());
streamGraph.setMaxParallelism(transform.getId(), transform.getMaxParallelism());
for (Integer inputId: inputIds) {
streamGraph.addEdge(inputId, transform.getId(), 0);
}
return Collections.singleton(transform.getId());
}
其中,在一开始,会得到该transformation的输入,尝试先于当前节点之前先转换其输入节点,由此,会一路往上回溯到source节点开始转换。
当其上游节点全部转换为streamGraph中的节点之后,就会开始当前节点的转换,而后,会根据determineSlotSharingGroup()方法确定该节点的slot共享名。
private String determineSlotSharingGroup(String specifiedGroup, Collection<Integer> inputIds) {
if (specifiedGroup != null) {
return specifiedGroup;
} else {
String inputGroup = null;
for (int id: inputIds) {
String inputGroupCandidate = streamGraph.getSlotSharingGroup(id);
if (inputGroup == null) {
inputGroup = inputGroupCandidate;
} else if (!inputGroup.equals(inputGroupCandidate)) {
return "default";
}
}
return inputGroup == null ? "default" : inputGroup;
}
}
在这里,如果没有专门制定相应的共享名,那么则会根据其上游输入节点进行确定,如果上游不一致,则会采用默认的default来指定。
在确定完共享名之后,streamGraph就会通过addOperaor()方法来将该节点加入到streamGraph中。在该方法中首先根据节点类型加入到stramGraph的streanNodes数组中,之后根据输入输出的类型得到相应的类型序列化对象设置到相应的节点中。
之后,会遍历当前节点的所有输入节点,调用addEdge()方法依次生成节点的边streamEdge。
public void addEdge(Integer upStreamVertexID, Integer downStreamVertexID, int typeNumber) {
addEdgeInternal(upStreamVertexID,
downStreamVertexID,
typeNumber,
null,
new ArrayList<String>(),
null);
}
private void addEdgeInternal(Integer upStreamVertexID,
Integer downStreamVertexID,
int typeNumber,
StreamPartitioner<?> partitioner,
List<String> outputNames,
OutputTag outputTag) {
if (virtualSideOutputNodes.containsKey(upStreamVertexID)) {
int virtualId = upStreamVertexID;
upStreamVertexID = virtualSideOutputNodes.get(virtualId).f0;
if (outputTag == null) {
outputTag = virtualSideOutputNodes.get(virtualId).f1;
}
addEdgeInternal(upStreamVertexID, downStreamVertexID, typeNumber, partitioner, null, outputTag);
} else if (virtualSelectNodes.containsKey(upStreamVertexID)) {
int virtualId = upStreamVertexID;
upStreamVertexID = virtualSelectNodes.get(virtualId).f0;
if (outputNames.isEmpty()) {
// selections that happen downstream override earlier selections
outputNames = virtualSelectNodes.get(virtualId).f1;
}
addEdgeInternal(upStreamVertexID, downStreamVertexID, typeNumber, partitioner, outputNames, outputTag);
} else if (virtualPartitionNodes.containsKey(upStreamVertexID)) {
int virtualId = upStreamVertexID;
upStreamVertexID = virtualPartitionNodes.get(virtualId).f0;
if (partitioner == null) {
partitioner = virtualPartitionNodes.get(virtualId).f1;
}
addEdgeInternal(upStreamVertexID, downStreamVertexID, typeNumber, partitioner, outputNames, outputTag);
} else {
StreamNode upstreamNode = getStreamNode(upStreamVertexID);
StreamNode downstreamNode = getStreamNode(downStreamVertexID);
// If no partitioner was specified and the parallelism of upstream and downstream
// operator matches use forward partitioning, use rebalance otherwise.
if (partitioner == null && upstreamNode.getParallelism() == downstreamNode.getParallelism()) {
partitioner = new ForwardPartitioner<Object>();
} else if (partitioner == null) {
partitioner = new RebalancePartitioner<Object>();
}
if (partitioner instanceof ForwardPartitioner) {
if (upstreamNode.getParallelism() != downstreamNode.getParallelism()) {
throw new UnsupportedOperationException("Forward partitioning does not allow " +
"change of parallelism. Upstream operation: " + upstreamNode + " parallelism: " + upstreamNode.getParallelism() +
", downstream operation: " + downstreamNode + " parallelism: " + downstreamNode.getParallelism() +
" You must use another partitioning strategy, such as broadcast, rebalance, shuffle or global.");
}
}
StreamEdge edge = new StreamEdge(upstreamNode, downstreamNode, typeNumber, outputNames, partitioner, outputTag);
getStreamNode(edge.getSourceId()).addOutEdge(edge);
getStreamNode(edge.getTargetId()).addInEdge(edge);
}
}
此处,可以看到一条StreamEdge的组成主要有以下几个要素,起始节点,终点节点,输出分区类型,output选择器选择的字段名。
在生成了相应的streamEdge会分别加在起始节点的 out边和终点节点的input边上。
至此,一个节点被加入到streamGraph的流程结束。
之后以spilt,select类型的transform为例子。
private <T> Collection<Integer> transformSplit(SplitTransformation<T> split) {
StreamTransformation<T> input = split.getInput();
Collection<Integer> resultIds = transform(input);
// the recursive transform call might have transformed this already
if (alreadyTransformed.containsKey(split)) {
return alreadyTransformed.get(split);
}
for (int inputId : resultIds) {
streamGraph.addOutputSelector(inputId, split.getOutputSelector());
}
return resultIds;
}
spilt类型的transformation不会加入到streamGraph中,而是会将split中的outputSelector加入到上游节点中。
private <T> Collection<Integer> transformSelect(SelectTransformation<T> select) {
StreamTransformation<T> input = select.getInput();
Collection<Integer> resultIds = transform(input);
// the recursive transform might have already transformed this
if (alreadyTransformed.containsKey(select)) {
return alreadyTransformed.get(select);
}
List<Integer> virtualResultIds = new ArrayList<>();
for (int inputId : resultIds) {
int virtualId = StreamTransformation.getNewNodeId();
streamGraph.addVirtualSelectNode(inputId, virtualId, select.getSelectedNames());
virtualResultIds.add(virtualId);
}
return virtualResultIds;
}
而select类型的节点则会加入到streamGraph中的虚拟节点中,在之前的edge构造中,就将根据实际节点的id得到虚拟select节点,将相应的outputName加入到edge之中。
分区策略也将类似select的形式生成虚拟节点。
private <T> Collection<Integer> transformPartition(PartitionTransformation<T> partition) {
StreamTransformation<T> input = partition.getInput();
List<Integer> resultIds = new ArrayList<>();
Collection<Integer> transformedIds = transform(input);
for (Integer transformedId: transformedIds) {
int virtualId = StreamTransformation.getNewNodeId();
streamGraph.addVirtualPartitionNode(transformedId, virtualId, partition.getPartitioner());
resultIds.add(virtualId);
}
return resultIds;
}
类似的union也不会引起新的节点生成,只是会统计所有输入节点id,交由统一在下一个实际节点中一次获取。
最后生成的streamGraph可以转化为下方类似的json。
{"nodes":[{"id":1,"type":"Source: Collection Source","pact":"Data Source","contents":"Source: Collection Source","parallelism":1},{"id":3,"type":"Map","pact":"Operator","contents":"Map","parallelism":8,"predecessors":[{"id":1,"ship_strategy":"REBALANCE","side":"second"}]},{"id":7,"type":"Map","pact":"Operator","contents":"Map","parallelism":8,"predecessors":[{"id":3,"ship_strategy":"BROADCAST","side":"second"}]},{"id":9,"type":"Map","pact":"Operator","contents":"Map","parallelism":8,"predecessors":[{"id":3,"ship_strategy":"FORWARD","side":"second"}]},{"id":13,"type":"Map","pact":"Operator","contents":"Map","parallelism":8,"predecessors":[{"id":3,"ship_strategy":"FORWARD","side":"second"}]},{"id":17,"type":"Map","pact":"Operator","contents":"Map","parallelism":8,"predecessors":[{"id":3,"ship_strategy":"FORWARD","side":"second"}]},{"id":24,"type":"Map","pact":"Operator","contents":"Map","parallelism":8,"predecessors":[{"id":9,"ship_strategy":"BROADCAST","side":"second"},{"id":13,"ship_strategy":"GLOBAL","side":"second"},{"id":17,"ship_strategy":"SHUFFLE","side":"second"}]},{"id":8,"type":"Sink: Unnamed","pact":"Data Sink","contents":"Sink: Unnamed","parallelism":8,"predecessors":[{"id":7,"ship_strategy":"FORWARD","side":"second"}]},{"id":25,"type":"Sink: Unnamed","pact":"Data Sink","contents":"Sink: Unnamed","parallelism":8,"predecessors":[{"id":24,"ship_strategy":"FORWARD","side":"second"}]}]}
在flink给出的https://flink.apache.org/visualizer/ 上可以转化为可视化图。