.

  • 一 .前言
  • 二 .代码解析
  • 2.1. 属性
  • 2.2. 构造方法
  • 2.3. 属性相关
  • 三 .Source相关
  • 3.1. StreamExecutionEnvironment#addSource
  • 3.2. DataStreamSource
  • 3.3. SingleOutputStreamOperator
  • 四 .算子相关的
  • 4.1. union
  • 4.2. connect
  • 4.3. keyBy
  • 4.4. partitionCustom
  • 4.5. broadcast
  • 4.6. shuffle
  • 4.7. forward
  • 4.8. rebalance
  • 4.9. rescale
  • 4.10. global
  • 4.11. iterate
  • 4.12. map
  • 4.13. flatMap
  • 4.14. process
  • 4.15. filter
  • 4.16. project
  • 4.17. coGroup
  • 4.18. join
  • 4.19. countWindowAll
  • 4.20. windowAll
  • 4.21. assignTimestampsAndWatermarks
  • 五 .Sink相关
  • 5.1. print
  • 5.2. writeToSocket
  • 5.3. addSink
  • 六 .transform 相关


一 .前言

Flink中的DataStream程序是对数据流进行转换的常规程序(例如,过滤,更新状态,定义窗口,聚合)。数据流的最初的源可以从各种来源(例如,消息队列,套接字流,文件)创建,并通过sink返回结果,例如可以将数据写入文件或标准输出。Flink程序以各种上下文运行,独立或嵌入其他程序中。执行可能发生在本地JVM或许多机器的集群上。

Flink 的 Java 和 Scala DataStream API 可以将任何可序列化的对象转化为流。Flink 自带的序列化器有

基本类型 : 即 String、Long、Integer、Boolean、Array
复合类型 : Tuples、POJOs 和 Scala case classes
而且 Flink 会交给 Kryo 序列化其他类型。也可以将其他序列化器和 Flink 一起使用。特别是有良好支持的 Avro。

每个 Flink 应用都需要有执行环境,在该示例中为 env。流式应用需要用到 StreamExecutionEnvironment。

DataStream API 将你的应用构建为一个 job graph,并附加到 StreamExecutionEnvironment 。当调用 env.execute() 时此 graph 就被打包并发送到 JobManager 上,后者对作业并行处理并将其子任务分发给 Task Manager 来执行。每个作业的并行子任务将在 task slot 中执行。

注意,如果没有调用 execute(),应用就不会运行。

flink流程图 转化为 mapreduce flink datastream转dataset_ide

二 .代码解析

Flink中的DataStream程序是对数据流进行转换的常规程序(例如,过滤,更新状态,定义窗口,聚合)。

flink流程图 转化为 mapreduce flink datastream转dataset_edn_02

2.1. 属性

只有两个属性, 一个是 StreamExecutionEnvironment environment 另一个是 Transformation<T> transformation .

protected final StreamExecutionEnvironment environment;

    protected final Transformation<T> transformation;

2.2. 构造方法

/**
     * Create a new {@link DataStream} in the given execution environment with partitioning set to
     * forward by default.
     *
     * @param environment The StreamExecutionEnvironment
     */
    public DataStream(StreamExecutionEnvironment environment, Transformation<T> transformation) {
        this.environment =
                Preconditions.checkNotNull(environment, "Execution Environment must not be null.");
        this.transformation =
                Preconditions.checkNotNull(
                        transformation, "Stream Transformation must not be null.");
    }

2.3. 属性相关

提供各种方法,获取属性

方法

描述

getId

获取当前 DataStream id

getParallelism

获取并行度

getMinResources

获取最小资源

getPreferredResources

获取首选资源

getType

获取输出类型

clean

清理function

getExecutionEnvironment

获取StreamExecutionEnvironment

getExecutionConfig

获取StreamExecutionEnvironment中的配置…

三 .Source相关

3.1. StreamExecutionEnvironment#addSource

StreamExecutionEnvironment#addSource 方法用于添加数据源, 返回的是一个DataStreamSource .

/**
     * Ads a data source with a custom type information thus opening a {@link DataStream}. Only in
     * very special cases does the user need to support type information. Otherwise use {@link
     * #addSource(org.apache.flink.streaming.api.functions.source.SourceFunction)}
     *
     * @param function the user defined function
     * @param sourceName Name of the data source
     * @param <OUT> type of the returned stream
     * @param typeInfo the user defined type information for the stream
     * @return the data stream constructed
     */
    public <OUT> DataStreamSource<OUT> addSource(
            SourceFunction<OUT> function, String sourceName, TypeInformation<OUT> typeInfo) {
        return addSource(function, sourceName, typeInfo, Boundedness.CONTINUOUS_UNBOUNDED);
    }

    // addSource方法用来添加一个数据源到计算任务中。
    // 默认情况下数据源是非并行的,
    // 用户需要实现ParallelSourceFunction接口或者继承RichParallelSourceFunction来实现可并行的数据源。
    //
    // addSource方法将一个StreamFunction封装为StreamSource,
    // 当数据源开始执行时调用SourceFunction#run(SourceContext<T> ctx)方法,
    // 持续地向SourceContext发送生成的数据。
    private <OUT> DataStreamSource<OUT> addSource(
            // SocketTextStreamFunction
            final SourceFunction<OUT> function,
            // Socket Stream
            final String sourceName,
            // null
            @Nullable final TypeInformation<OUT> typeInfo,
            // 无界: Boundedness.CONTINUOUS_UNBOUNDED
            final Boundedness boundedness) {
        // Boundedness 是枚举类, 代表数据源的类型是有界数据[BOUNDED]或者是无界[CONTINUOUS_UNBOUNDED]数据.
        checkNotNull(function);
        checkNotNull(sourceName);
        checkNotNull(boundedness);

        // 获取数据的处理类型 : String
        TypeInformation<OUT> resolvedTypeInfo =
                getTypeInfo(function, sourceName, SourceFunction.class, typeInfo);

        // 是否是并行的SourceFunction  : false
        boolean isParallel = function instanceof ParallelSourceFunction;

        // 清理函数...
        clean(function);

        // 构建sourceOperator, 这个比较复杂, 以后再看...
        final StreamSource<OUT, ?> sourceOperator = new StreamSource<>(function);

        // 构建 DataStreamSource
        return new DataStreamSource<>(
                this, resolvedTypeInfo, sourceOperator, isParallel, sourceName, boundedness);
    }

在这里,看到构建一个DataStreamSource, 这个是啥, 瞄一下…

3.2. DataStreamSource

flink流程图 转化为 mapreduce flink datastream转dataset_数据源_03


StreamExecutionEnvironment#addSource 最终new了一个 DataStreamSource对象…

我们可以看到DataStreamSource继承自SingleOutputStreamOperator . SingleOutputStreamOperator又继承自DataStream …

DataStreamSource 里面就一个 isParallel 属性. 剩下的都是构造方法,

public DataStreamSource(
            StreamExecutionEnvironment environment,
            TypeInformation<T> outTypeInfo,
            StreamSource<T, ?> operator,
            boolean isParallel,
            String sourceName) {
        this(
                environment,
                outTypeInfo,
                operator,
                isParallel,
                sourceName,
                Boundedness.CONTINUOUS_UNBOUNDED);
    }

    /** The constructor used to create legacy sources. */
    public DataStreamSource(
            StreamExecutionEnvironment environment,
            TypeInformation<T> outTypeInfo,
            StreamSource<T, ?> operator,
            boolean isParallel,
            String sourceName,
            Boundedness boundedness) {
        super(
                environment,
                new LegacySourceTransformation<>(
                        sourceName,
                        operator,
                        outTypeInfo,
                        environment.getParallelism(),
                        boundedness));

        this.isParallel = isParallel;
        if (!isParallel) {
            setParallelism(1);
        }
    }

    /**
     * Constructor for "deep" sources that manually set up (one or more) custom configured complex
     * operators.
     */
    public DataStreamSource(SingleOutputStreamOperator<T> operator) {
        super(operator.environment, operator.getTransformation());
        this.isParallel = true;
    }

    /** Constructor for new Sources (FLIP-27). */
    public DataStreamSource(
            StreamExecutionEnvironment environment,
            Source<T, ?, ?> source,
            WatermarkStrategy<T> watermarkStrategy,
            TypeInformation<T> outTypeInfo,
            String sourceName) {
        super(
                environment,
                new SourceTransformation<>(
                        sourceName,
                        source,
                        watermarkStrategy,
                        outTypeInfo,
                        environment.getParallelism()));
        this.isParallel = true;
    }

3.3. SingleOutputStreamOperator

SingleOutputStreamOperator 继承于DataStream. 所以具有 protected final Transformation<T> transformation; .

SingleOutputStreamOperator 中很多都是对于属性 protected final Transformation<T> transformation; . 的操作, 比如设置名称,资源,并行度,uid,超时时间 之类的操作…

/**
 *
 * {@code SingleOutputStreamOperator}表示应用于具有一个预定义输出类型的{@link DataStream}的用户定义的转换。
 * {@code SingleOutputStreamOperator} represents a user defined transformation applied on a {@link
 * DataStream} with one predefined output type.
 *
 * @param <T> The type of the elements in this stream.
 */

SingleOutputStreamOperator有一个属性private Map<OutputTag<?>, TypeInformation<?>> requestedSideOutputs = new HashMap<>();

/**
     * 我们跟踪已经请求的边输出及其类型。
     * 通过这种方法,我们可以捕捉到为不同类型请求具有匹配id的side输出的情况,因为这将导致运行时出现问题。
     * We keep track of the side outputs that were already requested and their types.
     *
     * With this, we can catch the case when a side output with a matching id is requested for a different type
     * because this would lead to problems at runtime.
     */
    private Map<OutputTag<?>, TypeInformation<?>> requestedSideOutputs = new HashMap<>();

    protected SingleOutputStreamOperator(
            StreamExecutionEnvironment environment, Transformation<T> transformation) {
        super(environment, transformation);
    }

四 .算子相关的

DataStream 封装了大量的算子操作, 都会调用transform 进行转换操作…

flink流程图 转化为 mapreduce flink datastream转dataset_edn_04

4.1. union

通过将相同类型的 {@link DataStream} 输出彼此合并,创建新的 {@link DataStream} .
使用此 operator 合并的DataStreams将同时进行转换。

/**
     *
     * 通过将相同类型的 {@link DataStream} 输出彼此合并,创建新的 {@link DataStream} .
     *
     * 使用此 operator 合并的DataStreams将同时进行转换。
     *
     * Creates a new {@link DataStream} by merging {@link DataStream} outputs of the same type with
     * each other.
     *
     * The DataStreams merged using this operator will be transformed simultaneously.
     *
     * @param streams The DataStreams to union output with.
     * @return The {@link DataStream}.
     */
    @SafeVarargs
    public final DataStream<T> union(DataStream<T>... streams) {
        List<Transformation<T>> unionedTransforms = new ArrayList<>();
        unionedTransforms.add(this.transformation);

        // 对输入的DataStream进行合并..
        for (DataStream<T> newStream : streams) {
            if (!getType().equals(newStream.getType())) {
                throw new IllegalArgumentException(
                        "Cannot union streams of different types: "
                                + getType()
                                + " and "
                                + newStream.getType());
            }

            unionedTransforms.add(newStream.getTransformation());
        }
        //构建新的 DataStream
        return new DataStream<>(this.environment, new UnionTransformation<>(unionedTransforms));
    }
  • UnionTransformation
    此转换表示多个输入{@link Transformation Transformations}的并集。
/**
     * 通过给定的Transformations 构建一个UnionTransformation
     *
     * Creates a new {@code UnionTransformation} from the given input {@code Transformations}.
     *
     * <p>The input {@code Transformations} must all have the same type.
     *
     * @param inputs The list of input {@code Transformations}
     */
    public UnionTransformation(List<Transformation<T>> inputs) {
        super("Union", inputs.get(0).getOutputType(), inputs.get(0).getParallelism());

        // 对输入的 List<Transformation<T>> inputs 进行合并操作...
        for (Transformation<T> input : inputs) {
            if (!input.getOutputType().equals(getOutputType())) {
                throw new UnsupportedOperationException("Type mismatch in input " + input);
            }
        }

        this.inputs = Lists.newArrayList(inputs);
    }

4.2. connect

通过将(可能)不同类型的{@link DataStream}输出相互连接,创建一个新的{@link ConnectedStreams}
使用此运算符连接的DataStreams可以与 CoFunctions 一起应用联合变换。

/**
     *
     * 通过将(可能)不同类型的{@link DataStream}输出相互连接,创建一个新的{@link ConnectedStreams}。
     *
     * 使用此运算符连接的DataStreams可以与 CoFunctions 一起应用联合变换。
     *
     * Creates a new {@link ConnectedStreams} by connecting {@link DataStream} outputs of (possible)
     * different types with each other.
     *
     * The DataStreams connected using this operator can be used
     * with CoFunctions to apply joint transformations.
     *
     * @param dataStream The DataStream with which this stream will be connected.
     * @return The {@link ConnectedStreams}.
     */
    public <R> ConnectedStreams<T, R> connect(DataStream<R> dataStream) {
        return new ConnectedStreams<>(environment, this, dataStream);
    }

4.3. keyBy

它创建了一个新的 {@link KeyedStream} 来使用提供的键来划分其操作符状态。

/**
     * It creates a new {@link KeyedStream} that uses the provided key for partitioning its operator states.
     *
     * @param key The KeySelector to be used for extracting the key for partitioning
     * @return The {@link DataStream} with partitioned state (i.e. KeyedStream)
     */
    public <K> KeyedStream<T, K> keyBy(KeySelector<T, K> key) {
        Preconditions.checkNotNull(key);
        return new KeyedStream<>(this, clean(key));
    }

    /**
     * It creates a new {@link KeyedStream} that uses the provided key with explicit type
     * information for partitioning its operator states.
     *
     * @param key The KeySelector to be used for extracting the key for partitioning.
     * @param keyType The type information describing the key type.
     * @return The {@link DataStream} with partitioned state (i.e. KeyedStream)
     */
    public <K> KeyedStream<T, K> keyBy(KeySelector<T, K> key, TypeInformation<K> keyType) {
        Preconditions.checkNotNull(key);
        Preconditions.checkNotNull(keyType);
        return new KeyedStream<>(this, clean(key), keyType);
    }

    其他keyBy方法 略.....

4.4. partitionCustom

/**
     *
     * 使用自定义分区器在selector返回的key上分区DataStream。
     * 此方法使用 key selector 获取要分区的key,并使用接受该key type 的分区器。
     * 此方法仅适用于单个字段key,即selector 不能返回 tuples 类型的字段  。
     *
     * Partitions a DataStream on the key returned by the selector, using a custom partitioner.
     *
     * This method takes the key selector to get the key to partition on, and a partitioner that accepts
     * the key type.
     *
     * <p>Note: This method works only on single field keys, i.e. the selector cannot return tuples
     * of fields.
     *
     * @param partitioner The partitioner to assign partitions to keys.
     * @param keySelector The KeySelector with which the DataStream is partitioned.
     * @return The partitioned DataStream.
     * @see KeySelector
     */
    public <K> DataStream<T> partitionCustom(
            Partitioner<K> partitioner, KeySelector<T, K> keySelector) {
        return setConnectionType(
                new CustomPartitionerWrapper<>(clean(partitioner), clean(keySelector)));
    }

    //	private helper method for custom partitioning
    private <K> DataStream<T> partitionCustom(Partitioner<K> partitioner, Keys<T> keys) {
        KeySelector<T, K> keySelector =
                KeySelectorUtil.getSelectorForOneKey(
                        keys, partitioner, getType(), getExecutionConfig());

        return setConnectionType(
                new CustomPartitionerWrapper<>(clean(partitioner), clean(keySelector)));
    }
  • setConnectionType
    看到通过setConnectionType返回了一个新的DataStream 对象.
/**
     * Internal function for setting the partitioner for the DataStream.
     *
     * @param partitioner Partitioner to set.
     * @return The modified DataStream.
     */
    protected DataStream<T> setConnectionType(StreamPartitioner<T> partitioner) {
        return new DataStream<>(
                this.getExecutionEnvironment(),
                new PartitionTransformation<>(this.getTransformation(), partitioner));
    }

4.5. broadcast

/**
     * 设置 {@link DataStream}  的分区,以便将输出元素广播到下一个操作的每个并行实例。
     *
     * Sets the partitioning of the {@link DataStream} so that the output elements are broadcasted
     * to every parallel instance of the next operation.
     *
     * @return The DataStream with broadcast partitioning set.
     */
    public DataStream<T> broadcast() {
        return setConnectionType(new BroadcastPartitioner<T>());
    }

    /**
     * 设置{@link DataStream}的分区,以便输出元素广播到下一个操作的每个并行实例。
     *
     * 此外,它隐式地提供了尽可能多的
     * {@link org.apache.flink.api.common.state.BroadcastState broadcast states}
     * 作为可用于存储流元素的指定描述符.
     *
     * 
     * Sets the partitioning of the {@link DataStream} so that the output elements are broadcasted
     * to every parallel instance of the next operation.
     *
     * 设置{@link DataStream}的分区,以便输出元素广播到下一个操作的每个并行实例。
     * 
     * 此外,它隐式地提供了尽可能多的
     * {@link org.apache.flink.api.common.state.BroadcastState broadcast states}
     * 作为可用于存储流元素的指定描述符.
     *
     *
     * In addition, it implicitly as many {@link org.apache.flink.api.common.state.BroadcastState broadcast states} as the specified descriptors which can be used to store the element of the stream.
     *
     * @param broadcastStateDescriptors the descriptors of the broadcast states to create.
     * @return A {@link BroadcastStream} which can be used in the {@link #connect(BroadcastStream)}
     *     to create a {@link BroadcastConnectedStream} for further processing of the elements.
     */
    @PublicEvolving
    public BroadcastStream<T> broadcast(
            final MapStateDescriptor<?, ?>... broadcastStateDescriptors) {
        Preconditions.checkNotNull(broadcastStateDescriptors);
        final DataStream<T> broadcastStream = setConnectionType(new BroadcastPartitioner<>());
        return new BroadcastStream<>(environment, broadcastStream, broadcastStateDescriptors);
    }

4.6. shuffle

设置{@link DataStream}的分区,以便将输出元素均匀随机地洗牌到下一个操作。

/**
     * Sets the partitioning of the {@link DataStream} so that the output elements are shuffled
     * uniformly randomly to the next operation.
     *
     * @return The DataStream with shuffle partitioning set.
     */
    @PublicEvolving
    public DataStream<T> shuffle() {
        return setConnectionType(new ShufflePartitioner<T>());
    }

4.7. forward

使用RebalancePartitioner 平衡数据, 只将元素转发给本地运行的下游操作的分区程序。

/**
     * Sets the partitioning of the {@link DataStream} so that the output elements are forwarded to
     * the local subtask of the next operation.
     *
     * @return The DataStream with forward partitioning set.
     */
    public DataStream<T> forward() {
        return setConnectionType(new ForwardPartitioner<T>());
    }

4.8. rebalance

使用RebalancePartitioner 平衡数据 , 通过在输出channels中循环平均分配数据的分区器。

/**
     * Sets the partitioning of the {@link DataStream} so that the output elements are distributed
     * evenly to instances of the next operation in a round-robin fashion.
     *
     * @return The DataStream with rebalance partitioning set.
     */
    public DataStream<T> rebalance() {
        return setConnectionType(new RebalancePartitioner<T>());
    }

4.9. rescale

设置{@link DataStream}的分区,以便以 round-robin 方式将输出元素均匀地分布到下一个操作的实例子集。

上游操作向其发送元素的下游操作子集取决于上游和下游操作的并行度。

例如,如果上游操作的并行度为2,下游操作的并行度为4,
则一个上游操作会将元素分配给两个下游操作,而另一个操作会将元素分配给两个下游操作上游业务将分配给其他两个下游业务。

另一方面,如果下游操作具有并行度2,而上游操作具有并行度4,则两个上游操作将分配给一个下游操作,而其他两个上游操作将分配给其他下游操作。

如果不同的并行度不是彼此的倍数,则一个或多个下游操作将具有不同数量的上游操作输入。

/**
     *

     * Sets the partitioning of the {@link DataStream} so that the output elements are distributed
     * evenly to a subset of instances of the next operation in a round-robin fashion.
     *
     * <p>The subset of downstream operations to which the upstream operation sends elements depends
     * on the degree of parallelism of both the upstream and downstream operation.
     *
     * For example, if the upstream operation has parallelism 2 and the downstream operation has parallelism 4, then
     * one upstream operation would distribute elements to two downstream operations while the other
     * upstream operation would distribute to the other two downstream operations.
     *
     * If, on the other hand, the downstream operation has parallelism 2 while the upstream operation has parallelism
     * 4 then two upstream operations will distribute to one downstream operation while the other
     * two upstream operations will distribute to the other downstream operations.
     *
     * <p>In cases where the different parallelisms are not multiples of each other one or several
     * downstream operations will have a differing number of inputs from upstream operations.
     *
     * @return The DataStream with rescale partitioning set.
     */
    @PublicEvolving
    public DataStream<T> rescale() {
        return setConnectionType(new RescalePartitioner<T>());
    }

4.10. global

设置{@link DataStream}的分区,以便输出值都转到下一个处理操作符的第一个实例。
请小心使用此设置,因为它可能会在应用程序中造成严重的性能瓶颈。

/**
     * Sets the partitioning of the {@link DataStream} so that the output values all go to the first
     * instance of the next processing operator. Use this setting with care since it might cause a
     * serious performance bottleneck in the application.
     *
     * @return The DataStream with shuffle partitioning set.
     */
    @PublicEvolving
    public DataStream<T> global() {
        return setConnectionType(new GlobalPartitioner<T>());
    }

4.11. iterate

启动程序中反馈数据流的迭代部分。
迭代部分需要通过调用{@link IterativeStream#closeWith(DataStream)}来关闭。
这个IterativeStream的转换将是迭代头。
提供给{@link IterativeStream#closeWith(DataStream)}方法的数据流是将被反馈并用作迭代头的输入的数据流。
用户还可以使用不同于迭代输入的反馈类型,
并将输入流和反馈流视为{@link ConnectedStreams}调用{@link IterativeStream#withFeedbackType(TypeInformation)}

流式迭代的一种常见使用模式是使用输出分割将结束数据流的一部分发送到头部。

参考{@link ProcessFunction.Context#output(OutputTag, Object)}了解更多信息。

迭代边将以与迭代头的第一个输入相同的方式进行分区,除非在{@link IterativeStream#closeWith(DataStream)}调用中更改它。

默认情况下,带有迭代的数据流永远不会终止,但是用户可以使用maxWaitTime参数设置迭代头的最大等待时间。如果在设置的时间内没有接收到数据,则流终止。

/**
     * Initiates an iterative part of the program that feeds back data streams.
     *
     *
     * The iterative part  needs to be closed by calling {@link IterativeStream#closeWith(DataStream)}.
     *
     * The transformation of this IterativeStream will be the iteration head.
     *
     * The data stream given to the {@link IterativeStream#closeWith(DataStream)} method is the data stream that will be fedback and used as the input for the iteration head.
     *
     * The user can also use different feedback type than the input of the iteration and
     * treat the input and feedback streams as a {@link ConnectedStreams} be calling {@link IterativeStream#withFeedbackType(TypeInformation)}
     *
     * <p>A common usage pattern for streaming iterations is to use output splitting to send a part
     * of the closing data stream to the head. Refer to {@link
     * ProcessFunction.Context#output(OutputTag, Object)} for more information.
     *
     * <p>The iteration edge will be partitioned the same way as the first input of the iteration
     * head unless it is changed in the {@link IterativeStream#closeWith(DataStream)} call.
     *
     * <p>By default a DataStream with iteration will never terminate, but the user can use the
     * maxWaitTime parameter to set a max waiting time for the iteration head. If no data received
     * in the set time, the stream terminates.
     *
     * @return The iterative data stream created.
     */
    @PublicEvolving
    public IterativeStream<T> iterate() {
        return new IterativeStream<>(this, 0);
    }

4.12. map

在{@link DataStream}上应用 Map 转换
transformation 为数据流的每个元素调用一个{@link MapFunction}。
每个MapFunction调用只返回一个元素。

/**
     * 用户还可以扩展{@link RichMapFunction}以访问
     * {@link org.apache.flink.api.common.functions.RichFunction}接口 .
     * Applies a Map transformation on a {@link DataStream}.
     *
     * The transformation calls a {@link MapFunction} for each element of the DataStream.
     *
     * Each MapFunction call returns exactly one element.
     *
     * The user can also extend {@link RichMapFunction} to gain access to other features
     * provided by the {@link org.apache.flink.api.common.functions.RichFunction} interface.
     *
     * @param mapper The MapFunction that is called for each element of the DataStream.
     * @param outputType {@link TypeInformation} for the result type of the function.
     * @param <R> output type
     * @return The transformed {@link DataStream}.
     */
    public <R> SingleOutputStreamOperator<R> map(
            MapFunction<T, R> mapper, TypeInformation<R> outputType) {
        return transform("Map", outputType, new StreamMap<>(clean(mapper)));
    }

4.13. flatMap

在DataStream 使用各一个 FlatMap 转换操作.
转换为数据流的每个元素调用一个{@link FlatMapFunction}。
每个FlatMapFunction调用可以返回任意数量的元素,包括none。
用户还可以扩展{@link RichFlatMapFunction},以访问{@link org.apache.flink.api.common.functions.RichFunction} 接口。

/**
     * 在DataStream 使用各一个 FlatMap 转换操作.
     * 转换为数据流的每个元素调用一个{@link FlatMapFunction}。
     * 每个FlatMapFunction调用可以返回任意数量的元素,包括none。
     * 用户还可以扩展{@link RichFlatMapFunction},以访问{@link org.apache.flink.api.common.functions.RichFunction} 接口。
     * 
     * Applies a FlatMap transformation on a {@link DataStream}.
     *
     * The transformation calls a {@link FlatMapFunction} for each element of the DataStream.
     *
     * Each FlatMapFunction call can return any number of elements including none.
     *
     * The user can also extend {@link RichFlatMapFunction} to gain access to other features provided by the
     * {@link org.apache.flink.api.common.functions.RichFunction} interface.
     *
     * @param flatMapper The FlatMapFunction that is called for each element of the DataStream
     * @param outputType {@link TypeInformation} for the result type of the function.
     * @param <R> output type
     * @return The transformed {@link DataStream}.
     */
    public <R> SingleOutputStreamOperator<R> flatMap(
            FlatMapFunction<T, R> flatMapper, TypeInformation<R> outputType) {
        return transform("Flat Map", outputType, new StreamFlatMap<>(clean(flatMapper)));
    }

4.14. process

在输入流上应用给定的{@link ProcessFunction},从而创建转换后的输出流
该函数将为输入流中的每个元素调用,并且可以生成零个或多个输出元素。

/**
     * Applies the given {@link ProcessFunction} on the input stream, thereby creating a transformed
     * output stream.
     *
     * <p>The function will be called for every element in the input streams and can produce zero or
     * more output elements.
     *
     * @param processFunction The {@link ProcessFunction} that is called for each element in the
     *     stream.
     * @param outputType {@link TypeInformation} for the result type of the function.
     * @param <R> The type of elements emitted by the {@code ProcessFunction}.
     * @return The transformed {@link DataStream}.
     */
    @Internal
    public <R> SingleOutputStreamOperator<R> process(
            ProcessFunction<T, R> processFunction, TypeInformation<R> outputType) {

        ProcessOperator<T, R> operator = new ProcessOperator<>(clean(processFunction));

        return transform("Process", outputType, operator);
    }

4.15. filter

/**
     * Applies a Filter transformation on a {@link DataStream}. The transformation calls a {@link
     * FilterFunction} for each element of the DataStream and retains only those element for which
     * the function returns true. Elements for which the function returns false are filtered. The
     * user can also extend {@link RichFilterFunction} to gain access to other features provided by
     * the {@link org.apache.flink.api.common.functions.RichFunction} interface.
     *
     * @param filter The FilterFunction that is called for each element of the DataStream.
     * @return The filtered DataStream.
     */
    public SingleOutputStreamOperator<T> filter(FilterFunction<T> filter) {
        return transform("Filter", getType(), new StreamFilter<>(clean(filter)));
    }
public class StreamFilter<IN> extends AbstractUdfStreamOperator<IN, FilterFunction<IN>>
        implements OneInputStreamOperator<IN, IN> {

    private static final long serialVersionUID = 1L;

    public StreamFilter(FilterFunction<IN> filterFunction) {
        super(filterFunction);
        // 设置连接策略
        chainingStrategy = ChainingStrategy.ALWAYS;
    }

    @Override
    public void processElement(StreamRecord<IN> element) throws Exception {
        // 执行用户的过滤操作, 判断条件的满足情况
        if (userFunction.filter(element.getValue())) {
            output.collect(element);
        }
    }
}

4.16. project

/**
     *
     * 在{@link Tuple}{@linkdatastream}上启动项目转换。
     * 注意: 仅仅 Typle  DataStreams 可以构建成projected
     * 转换将数据集的每个元组投影到一组(子)字段上。
     * 
     *
     * Initiates a Project transformation on a {@link Tuple} {@link DataStream}.<br>
     * <b>Note: Only Tuple DataStreams can be projected.</b>
     *
     * <p>The transformation projects each Tuple of the DataSet onto a (sub)set of fields.
     *
     * @param fieldIndexes The field indexes of the input tuples that are retained. The order of
     *     fields in the output tuple corresponds to the order of field indexes.
     * @return The projected DataStream
     * @see Tuple
     * @see DataStream
     */
    @PublicEvolving
    public <R extends Tuple> SingleOutputStreamOperator<R> project(int... fieldIndexes) {
        return new StreamProjection<>(this, fieldIndexes).projectTupleX();
    }

4.17. coGroup

/**
     *
     * 创建一个 join 操作
     * Creates a join operation.
     * 有关如何指定键和窗口的示例,请参见{@link CoGroupedStreams}。
     *
     * See {@link CoGroupedStreams} for an example of how the keys and window can be specified.
     */
    public <T2> CoGroupedStreams<T, T2> coGroup(DataStream<T2> otherStream) {
        return new CoGroupedStreams<>(this, otherStream);
    }

4.18. join

/**
     * Creates a join operation. See {@link JoinedStreams} for an example of how the keys and window
     * can be specified.
     */
    public <T2> JoinedStreams<T, T2> join(DataStream<T2> otherStream) {
        return new JoinedStreams<>(this, otherStream);
    }

4.19. countWindowAll

/**
     * 将此{@code DataStream}放入滑动计数窗口。
     * Windows this {@code DataStream} into sliding count windows.
     *
     * <p>Note: This operation is inherently non-parallel since all elements have to pass through
     * the same operator instance.
     *
     * @param size The size of the windows in number of elements.
     * @param slide The slide interval in number of elements.
     */
    public AllWindowedStream<T, GlobalWindow> countWindowAll(long size, long slide) {
        return windowAll(GlobalWindows.create())
                .evictor(CountEvictor.of(size))
                .trigger(CountTrigger.of(slide));
    }

4.20. windowAll

/**
     *
     * Windows将此数据流转换为{@code AllWindowedStream},它在非键分组流上计算Windows。
     *
     * 元素通过 {@link org.apache.flink.streaming.api.windowing.assigners.WindowAssigner}. 放入到windows中.
     * 元素的分组是按窗口进行的。
     *
     * 一个 {@link org.apache.flink.streaming.api.windowing.triggers.Trigger} 可以定义以指定何时计算窗口。
     *
     * 但是,{@code WindowAssigners}有一个默认的{@code Trigger},如果未指定{@code Trigger},则使用它。
     *
     * 注意:此操作本质上是非并行的,因为所有元素都必须通过相同的操作符实例。
     *
     * Windows this data stream to a {@code AllWindowedStream}, which evaluates windows over a non key grouped stream.
     *
     *
     * Elements are put into windows by a {@link org.apache.flink.streaming.api.windowing.assigners.WindowAssigner}.
     *
     * The grouping of elements  is done by window.
     *
     * <p>A {@link org.apache.flink.streaming.api.windowing.triggers.Trigger} can be defined to
     * specify when windows are evaluated.
     *
     * However, {@code WindowAssigners} have a default {@code Trigger} that is used if a {@code Trigger} is not specified.
     *
     * <p>Note: This operation is inherently non-parallel since all elements have to pass through
     * the same operator instance.
     *
     * @param assigner The {@code WindowAssigner} that assigns elements to windows.
     * @return The trigger windows data stream.
     */
    @PublicEvolving
    public <W extends Window> AllWindowedStream<T, W> windowAll(
            WindowAssigner<? super T, W> assigner) {
        return new AllWindowedStream<>(this, assigner);
    }

4.21. assignTimestampsAndWatermarks

/**
     *
     * 为数据流中的元素分配时间戳,并生成watermarks来表示事件时间进度。
     * 给定的{@link WatermarkStrategy}是通过{@link TimestampAssigner} and {@link WatermarkGenerator} 构建的
     * 对于数据流中的每个事件,调用{@link TimestampAssigner#extractTimestamp(Object,long)}方法来分配事件时间戳。
     *
     * <p>定期地 (defined by the {@link ExecutionConfig#getAutoWatermarkInterval()}), the
     * {@link WatermarkGenerator#onPeriodicEmit(WatermarkOutput)} method will be called.
     *
     * 通用的watermark生成静态方法在{@link org.apache.flink.api.common.eventtime.WatermarkStrategy} 中.
     * 
     * Assigns timestamps to the elements in the data stream and generates watermarks to signal
     * event time progress.
     *
     * The given {@link WatermarkStrategy} is used to create a {@link TimestampAssigner} and {@link WatermarkGenerator}.
     *
     * <p>For each event in the data stream, the {@link TimestampAssigner#extractTimestamp(Object,
     * long)} method is called to assign an event timestamp.
     *
     * <p>For each event in the data stream, the {@link WatermarkGenerator#onEvent(Object, long,
     * WatermarkOutput)} will be called.
     *
     * <p>Periodically (defined by the {@link ExecutionConfig#getAutoWatermarkInterval()}), the
     * {@link WatermarkGenerator#onPeriodicEmit(WatermarkOutput)} method will be called.
     *
     * <p>Common watermark generation patterns can be found as static methods in the {@link
     * org.apache.flink.api.common.eventtime.WatermarkStrategy} class.
     *
     * @param watermarkStrategy The strategy to generate watermarks based on event timestamps.
     * @return The stream after the transformation, with assigned timestamps and watermarks.
     */
    public SingleOutputStreamOperator<T> assignTimestampsAndWatermarks(
            WatermarkStrategy<T> watermarkStrategy) {
        // 清理WatermarkStrategy
        final WatermarkStrategy<T> cleanedStrategy = clean(watermarkStrategy);

        // 匹配input的并行度, source -> timestamps/watermarks 的关系必须是 1:1
        // match parallelism to input, to have a 1:1 source -> timestamps/watermarks relationship and chain
        final int inputParallelism = getTransformation().getParallelism();
        
        //  生成 TimestampsAndWatermarksTransformation
        final TimestampsAndWatermarksTransformation<T> transformation =
                new TimestampsAndWatermarksTransformation<>(
                        "Timestamps/Watermarks",
                        inputParallelism,
                        getTransformation(),
                        cleanedStrategy);
        // 向env中加入TimestampsAndWatermarksTransformation
        getExecutionEnvironment().addOperator(transformation);
        
        // 返回SingleOutputStreamOperator
        return new SingleOutputStreamOperator<>(getExecutionEnvironment(), transformation);
    }

五 .Sink相关

5.1. print

构造一个PrintSinkFunction 通过addSink方法返回一个DataStreamSink .

/**
     * Writes a DataStream to the standard output stream (stdout).
     *
     * <p>For each element of the DataStream the result of {@link Object#toString()} is written.
     *
     * <p>NOTE: This will print to stdout on the machine where the code is executed, i.e. the Flink
     * worker.
     *
     * @return The closed DataStream.
     */
    @PublicEvolving
    public DataStreamSink<T> print() {
        PrintSinkFunction<T> printFunction = new PrintSinkFunction<>();
        return addSink(printFunction).name("Print to Std. Out");
    }

    /**
     * Writes a DataStream to the standard output stream (stderr).
     *
     * <p>For each element of the DataStream the result of {@link Object#toString()} is written.
     *
     * <p>NOTE: This will print to stderr on the machine where the code is executed, i.e. the Flink
     * worker.
     *
     * @return The closed DataStream.
     */
    @PublicEvolving
    public DataStreamSink<T> printToErr() {
        PrintSinkFunction<T> printFunction = new PrintSinkFunction<>(true);
        return addSink(printFunction).name("Print to Std. Err");
    }

    /**
     * Writes a DataStream to the standard output stream (stdout).
     *
     * <p>For each element of the DataStream the result of {@link Object#toString()} is written.
     *
     * <p>NOTE: This will print to stdout on the machine where the code is executed, i.e. the Flink
     * worker.
     *
     * @param sinkIdentifier The string to prefix the output with.
     * @return The closed DataStream.
     */
    @PublicEvolving
    public DataStreamSink<T> print(String sinkIdentifier) {
        PrintSinkFunction<T> printFunction = new PrintSinkFunction<>(sinkIdentifier, false);
        return addSink(printFunction).name("Print to Std. Out");
    }

    /**
     * Writes a DataStream to the standard output stream (stderr).
     *
     * <p>For each element of the DataStream the result of {@link Object#toString()} is written.
     *
     * <p>NOTE: This will print to stderr on the machine where the code is executed, i.e. the Flink
     * worker.
     *
     * @param sinkIdentifier The string to prefix the output with.
     * @return The closed DataStream.
     */
    @PublicEvolving
    public DataStreamSink<T> printToErr(String sinkIdentifier) {
        PrintSinkFunction<T> printFunction = new PrintSinkFunction<>(sinkIdentifier, true);
        return addSink(printFunction).name("Print to Std. Err");
    }

5.2. writeToSocket

/**
     * Writes the DataStream to a socket as a byte array. The format of the output is specified by a
     * {@link SerializationSchema}.
     *
     * @param hostName host of the socket
     * @param port port of the socket
     * @param schema schema for serialization
     * @return the closed DataStream
     */
    @PublicEvolving
    public DataStreamSink<T> writeToSocket(
            String hostName, int port, SerializationSchema<T> schema) {
        DataStreamSink<T> returnStream = addSink(new SocketClientSink<>(hostName, port, schema, 0));
        returnStream.setParallelism(
                1); // It would not work if multiple instances would connect to the same port
        return returnStream;
    }

5.3. addSink

/**
     * Adds the given sink to this DataStream. Only streams with sinks added will be executed once
     * the {@link StreamExecutionEnvironment#execute()} method is called.
     *
     * @param sinkFunction The object containing the sink's invoke function.
     * @return The closed DataStream.
     */
    public DataStreamSink<T> addSink(SinkFunction<T> sinkFunction) {

        // read the output type of the input Transform to coax out errors about MissingTypeInfo
        transformation.getOutputType();

        // configure the type if needed
        if (sinkFunction instanceof InputTypeConfigurable) {
            ((InputTypeConfigurable) sinkFunction).setInputType(getType(), getExecutionConfig());
        }

        StreamSink<T> sinkOperator = new StreamSink<>(clean(sinkFunction));

        DataStreamSink<T> sink = new DataStreamSink<>(this, sinkOperator);

        getExecutionEnvironment().addOperator(sink.getTransformation());
        return sink;
    }
  • StreamSink # processElement
@Override
    public void processElement(StreamRecord<IN> element) throws Exception {
        sinkContext.element = element;
        userFunction.invoke(element.getValue(), sinkContext);
    }

六 .transform 相关

/**
     * 方法传递用户定义的运算符以及将转换DataStreams的类型信息。
     *
     * Method for passing user defined operators along with the type information that will transform
     * the DataStream.
     *
     * @param operatorName name of the operator, for logging purposes
     * @param outTypeInfo the output type of the operator
     * @param operator the object containing the transformation logic
     * @param <R> type of the return stream
     * @return the data stream constructed
     * @see #transform(String, TypeInformation, OneInputStreamOperatorFactory)
     */
    @PublicEvolving
    public <R> SingleOutputStreamOperator<R> transform(
            String operatorName,
            TypeInformation<R> outTypeInfo,
            OneInputStreamOperator<T, R> operator) {

        //  operatorName:  Flat Map
        //  outTypeInfo : PojoType<org.apache.flink.streaming.examples.socket.SocketWindowWordCount$WordWithCount, fields = [count: Long, word: String]>
        //  operator : StreamFlatMap
        return doTransform(operatorName, outTypeInfo, SimpleOperatorFactory.of(operator));
    }




    protected <R> SingleOutputStreamOperator<R> doTransform(
            String operatorName,
            TypeInformation<R> outTypeInfo,
            StreamOperatorFactory<R> operatorFactory) {

        // read the output type of the input Transform to coax out errors about MissingTypeInfo
        transformation.getOutputType();

        // 构造OneInputTransformation 单输入转换
        OneInputTransformation<T, R> resultTransform =
                new OneInputTransformation<>(
                        this.transformation,
                        operatorName,
                        operatorFactory,
                        outTypeInfo,
                        environment.getParallelism());

        @SuppressWarnings({"unchecked", "rawtypes"})
        SingleOutputStreamOperator<R> returnStream =
                new SingleOutputStreamOperator(environment, resultTransform);

        // 添加到StreamExecutionEnvironment中的 List<Transformation<?>> transformations 集合中
        getExecutionEnvironment().addOperator(resultTransform);

        return returnStream;
    }