不得不说,英语翻译过来就没那味儿了,有能力还是看英语比较好。


文章目录

  • What is Apache Flink? — Applications 阿帕奇佛令克是啥?从应用的角度康康吧。
  • Building Blocks for Streaming Applications 流应用的组成部分
  • Streams 流
  • State 状态
  • Time 时间
  • Layered APIs 层级API
  • The ProcessFunctions
  • The DataStream API
  • SQL & Table API
  • Libraries


What is Apache Flink? — Applications 阿帕奇佛令克是啥?从应用的角度康康吧。

Apache Flink is a framework for stateful computations over unbounded and bounded data streams.
Apache Flink是一个在无边界和有边界数据流上进行有状态计算的框架。

Flink provides multiple APIs at different levels of abstraction and offers dedicated libraries for common use cases.
Flink在不同的抽象级别上提供多个api,并为常见用例提供专用库。

Here, we present Flink’s easy-to-use and expressive APIs and libraries.
在这里,我们展示了Flink易于使用和表达的api和库。

Building Blocks for Streaming Applications 流应用的组成部分

The types of applications that can be built with and executed by a stream processing framework are defined by how well the framework controls streams, state, and time.

可以用流处理框架构建和执行的应用程序类型是由框架控制流、状态和时间的好坏来定义的。

In the following, we describe these building blocks for stream processing applications and explain Flink’s approaches to handle them.
下面,我们将描述流处理应用程序的这些组成成分,并解释Flink操纵它们的方法。

Streams 流

Obviously, streams are a fundamental aspect of stream processing.
显然,流是流处理的一个基本方面。

However, streams can have different characteristics that affect how a stream can and should be processed.
然而,流可以具有影响流可以和应该如何处理的不同特征。

Flink is a versatile(多才多艺的) processing framework that can handle any kind of stream.
Flink是一个多才多艺的处理框架,可以处理任何类型的流。

Bounded and unbounded streams: Streams can be unbounded or bounded, i.e., fixed-sized data sets.
有界和无界流:流可以是无界的,也可以是有界的,例如固定大小的数据集。

Flink has sophisticated features to process unbounded streams, but also dedicated operators to efficiently process bounded streams.
Flink具有处理无边界流的复杂特性,但也有专门的操作符来有效地处理有边界流。

Real-time and recorded streams: All data are generated as streams.
实时和记录流:所有数据都生成为流。

There are two ways to process the data.
有两种处理数据的方法。

Processing it in real-time as it is generated or persisting the stream to a storage system, e.g., a file system or object store, and processed it later.
在生成时实时处理它,或者将流持久化到存储系统,例如文件系统或对象存储,然后再处理它。

Flink applications can process recorded or real-time streams.
Flink应用程序可以处理记录的或实时的流。

State 状态

Every non-trivial(重要的) streaming application is stateful(有状态的),i.e.(that is), only applications that apply transformations on individual events do not require state.
基本上每一个非平凡的(重要的)流媒体应用程序都是有状态的(有状态的),即也就是说,只有对单个事件应用转换的应用程序才不需要状态。

Any application that runs basic business logic needs to remember events or intermediate results to access them at a later point in time, for example when the next event is received or after a specific time duration.

运行基本业务逻辑的任何应用程序都需要记住事件或中间结果,以便在稍后的某个时间点访问它们,例如在接收到下一个事件时或在特定的时间持续时间之后。

flink 社区好像是有一个 app flink apply_flink


Application state is a first-class citizen in Flink.

应用状态是Flink中的一等公民,状态相当重要。

You can see that by looking at all the features that Flink provides in the context of state handling.
您可以通过查看Flink在状态处理上下文中提供的所有特性来了解这一点。

  • Multiple State Primitives: Flink provides state primitives for different data structures, such as atomic values, lists, or maps. Developers can choose the state primitive that is most efficient based on the access pattern of the function.
    Flink为不同的数据结构(如原子值、列表或映射)提供状态原语。开发人员可以根据函数的访问模式选择效率最高的状态原语。
  • Pluggable State Backends: Application state is managed in and checkpointed by a pluggable state backend. Flink features different state backends that store state in memory or in RocksDB, an efficient embedded on-disk data store. Custom state backends can be plugged in as well.
    应用程序状态由可插拔状态后端管理和检查点。Flink提供了不同的状态后端,它们将状态存储在内存或RocksDB中,RocksDB是一种高效的嵌入式磁盘数据存储。也可以插入自定义状态后端。
  • Exactly-once state consistency: Flink’s checkpointing and recovery algorithms guarantee the consistency of application state in case of a failure. Hence, failures are transparently handled and do not affect the correctness of an application.
    Flink的检查点和恢复算法保证了在发生故障时应用程序状态的一致性。因此,失败将被透明地处理,并且不会影响应用程序的正确性。
  • Very Large State: Flink is able to maintain application state of several terabytes in size due to its asynchronous and incremental checkpoint algorithm.
    由于它的异步和增量检查点算法,Flink能够维护几个tb的应用程序状态。
  • Scalable Applications: Flink supports scaling of stateful applications by redistributing the state to more or fewer workers.
    Flink通过将状态重新分配给更多或更少的工人来支持有状态应用程序的扩展。

Time 时间

Time is another important ingredient(构成要素) of streaming applications.
时间是流媒体应用程序的另一个重要的成分

Most event streams have inherent(与生俱来的) time semantics(语义) because each event is produced at a specific point in time.

大多数事件流有固有的(与生俱来的)时间语义(语义)因为每个事件产生一个特定的时间点。

Moreover, many common stream computations are based on time, such as windows aggregations, sessionization, pattern detection, and time-based joins.
此外,许多常见的流计算都是基于时间的,例如windows聚合、会话化、模式检测和基于时间的连接。

An important aspect of stream processing is how an application measures time, i.e., the difference of event-time and processing-time.
流处理的一个重要方面是应用程序如何度量时间,比如事件时间和处理时间的差异。

Flink provides a rich set of time-related features.
Flink提供了一组丰富的与时间相关的特性。

  • Event-time Mode: Applications that process streams with event-time semantics compute results based on timestamps of the events. Thereby, event-time processing allows for accurate and consistent results regardless whether recorded or real-time events are processed.
    使用事件时间语义处理流的应用程序根据事件的时间戳计算结果。因此,事件时间处理允许精确和一致的结果,无论处理的是记录的事件还是实时事件。
  • Watermark Support: Flink employs watermarks to reason about time in event-time applications. Watermarks are also a flexible mechanism to trade-off(权衡) the latency(延迟) and completeness of results.
    Flink在事件时间应用程序中使用watermark机制来判断时间。水印也是一个灵活的机制来平衡(权衡)的延迟(延迟)和完整性的结果。
  • Late Data Handling: When processing streams in event-time mode with watermarks, it can happen that a computation has been completed before all associated events have arrived. Such events are called late events. Flink features multiple options to handle late events, such as rerouting(重新路由) them via side outputs and updating previously completed results.
  • Processing-time Mode: In addition to(除…之外) its event-time mode, Flink also supports processing-time semantics which performs computations as triggered by the wall-clock time of the processing machine. The processing-time mode can be suitable for certain applications with strict low-latency requirements that can tolerate approximate results.

Layered APIs 层级API

Flink provides three layered APIs. Each API offers a different trade-off between conciseness and expressiveness and targets different use cases.

flink 社区好像是有一个 app flink apply_ide_02

We briefly present each API, discuss its applications, and show a code example.

The ProcessFunctions

ProcessFunctions are the most expressive function interfaces that Flink offers. Flink provides ProcessFunctions to process individual events from one or two input streams or events that were grouped in a window. ProcessFunctions provide fine-grained control over time and state. A ProcessFunction can arbitrarily modify its state and register timers that will trigger a callback function in the future. Hence, ProcessFunctions can implement complex per-event business logic as required for many stateful event-driven applications.

The following example shows a KeyedProcessFunction that operates on a KeyedStream and matches START and END events. When a START event is received, the function remembers its timestamp in state and registers a timer in four hours. If an END event is received before the timer fires, the function computes the duration between END and START event, clears the state, and returns the value. Otherwise, the timer just fires and clears the state.

/**
 * Matches keyed START and END events and computes the difference between 
 * both elements' timestamps. The first String field is the key attribute, 
 * the second String attribute marks START and END events.
 */
public static class StartEndDuration
    extends KeyedProcessFunction<String, Tuple2<String, String>, Tuple2<String, Long>> {

  private ValueState<Long> startTime;

  @Override
  public void open(Configuration conf) {
    // obtain state handle
    startTime = getRuntimeContext()
      .getState(new ValueStateDescriptor<Long>("startTime", Long.class));
  }

  /** Called for each processed event. */
  @Override
  public void processElement(
      Tuple2<String, String> in,
      Context ctx,
      Collector<Tuple2<String, Long>> out) throws Exception {

    switch (in.f1) {
      case "START":
        // set the start time if we receive a start event.
        startTime.update(ctx.timestamp());
        // register a timer in four hours from the start event.
        ctx.timerService()
          .registerEventTimeTimer(ctx.timestamp() + 4 * 60 * 60 * 1000);
        break;
      case "END":
        // emit the duration between start and end event
        Long sTime = startTime.value();
        if (sTime != null) {
          out.collect(Tuple2.of(in.f0, ctx.timestamp() - sTime));
          // clear the state
          startTime.clear();
        }
      default:
        // do nothing
    }
  }

  /** Called when a timer fires. */
  @Override
  public void onTimer(
      long timestamp,
      OnTimerContext ctx,
      Collector<Tuple2<String, Long>> out) {

    // Timeout interval exceeded. Cleaning up the state.
    startTime.clear();
  }
}

The example illustrates the expressive power of the KeyedProcessFunction but also highlights that it is a rather verbose interface.

The DataStream API

The DataStream API provides primitives for many common stream processing operations, such as windowing, record-at-a-time transformations, and enriching events by querying an external data store. The DataStream API is available for Java and Scala and is based on functions, such as map(), reduce(), and aggregate(). Functions can be defined by extending interfaces or as Java or Scala lambda functions.

The following example shows how to sessionize a clickstream and count the number of clicks per session.

// a stream of website clicks
DataStream<Click> clicks = ...

DataStream<Tuple2<String, Long>> result = clicks
  // project clicks to userId and add a 1 for counting
  .map(
    // define function by implementing the MapFunction interface.
    new MapFunction<Click, Tuple2<String, Long>>() {
      @Override
      public Tuple2<String, Long> map(Click click) {
        return Tuple2.of(click.userId, 1L);
      }
    })
  // key by userId (field 0)
  .keyBy(0)
  // define session window with 30 minute gap
  .window(EventTimeSessionWindows.withGap(Time.minutes(30L)))
  // count clicks per session. Define function as lambda function.
  .reduce((a, b) -> Tuple2.of(a.f0, a.f1 + b.f1));

SQL & Table API

Flink features two relational APIs, the Table API and SQL. Both APIs are unified APIs for batch and stream processing, i.e., queries are executed with the same semantics on unbounded, real-time streams or bounded, recorded streams and produce the same results. The Table API and SQL leverage Apache Calcite for parsing, validation, and query optimization. They can be seamlessly(无缝的) integrated with the DataStream and DataSet APIs and support user-defined scalar, aggregate, and table-valued functions.

Flink’s relational APIs are designed to ease the definition of data analytics, data pipelining, and ETL applications.

The following example shows the SQL query to sessionize a clickstream and count the number of clicks per session. This is the same use case as in the example of the DataStream API.

SELECT userId, COUNT(*)
FROM clicks
GROUP BY SESSION(clicktime, INTERVAL '30' MINUTE), userId

Libraries

Flink features several libraries for common data processing use cases. The libraries are typically embedded in an API and not fully self-contained. Hence, they can benefit from all features of the API and be integrated with other libraries.

  • Complex Event Processing (CEP): Pattern detection is a very common use case for event stream processing. Flink’s CEP library provides an API to specify patterns of events (think of regular expressions or state machines). The CEP library is integrated with Flink’s DataStream API, such that patterns are evaluated on DataStreams. Applications for the CEP library include network intrusion(入侵) detection, business process monitoring, and fraud detection.
  • DataSet API: The DataSet API is Flink’s core API for batch processing applications. The primitives of the DataSet API include map, reduce, (outer) join, co-group, and iterate. All operations are backed by algorithms and data structures that operate on serialized data in memory and spill to disk if the data size exceed the memory budget. The data processing algorithms of Flink’s DataSet API are inspired by traditional database operators, such as hybrid hash-join or external merge-sort.
  • Gelly: Gelly is a library for scalable graph processing and analysis. Gelly is implemented on top of and integrated with the DataSet API. Hence, it benefits from its scalable and robust operators. Gelly features built-in algorithms, such as label propagation(传播,繁殖), triangle enumeration, and page rank, but provides also a Graph API that eases the implementation of custom graph algorithms.