窗口函数的使用

  • Windows Function
  • ReduceFunction
  • AggregateFunction
  • ProcessWindowFunction
  • ProcessWindowFunction with Incremental Aggregation(处理窗口函数和增加聚合函数结合)
  • Incremental Window Aggregation with ReduceFunction(ReduceFuntione 结合 增长聚合窗口)
  • Incremental Window Aggregation with AggregateFunction(AggregateFunction 结合增长聚合窗口)
  • Using per-window state in ProcessWindowFunction
  • WindowFunction (Legacy)
  • Keyed Windows


Windows Function

ReduceFunction

ReduceFunction指定如何组合输入中的两个元素以生成相同类型的输出元素。Flink使用ReduceFunction递增聚合窗口的元素。
e.g

输入类型要和输出类型一致

val input: DataStream[(String, Long)] = ...

input
    .keyBy(<key selector>)
    .window(<window assigner>)
    .reduce { (v1, v2) => (v1._1, v1._2 + v2._2) }

AggregateFunction

AggregateFunction是ReduceFunction的通用版本,它有三种类型:输入类型(IN)、累加器类型(ACC)和输出类型(OUT)。输入类型是输入流中元素的类型,AggregateFunction有一个将一个输入元素添加到累加器的方法。该接口还有用于创建初始累加器、将两个累加器合并为一个累加器以及从累加器提取输出(OUT类型)的方法。我们将在下面的例子中看到它是如何工作的。
与ReduceFunction一样,Flink将在窗口的输入元素到达时递增地聚合它们。

/**
 * The accumulator is used to keep a running sum and a count. The [getResult] method
 * computes the average.
 */
class AverageAggregate extends AggregateFunction[(String, Long), (Long, Long), Double] {
  override def createAccumulator() = (0L, 0L)

  override def add(value: (String, Long), accumulator: (Long, Long)) =
    (accumulator._1 + value._2, accumulator._2 + 1L)

  override def getResult(accumulator: (Long, Long)) = accumulator._1 / accumulator._2

  override def merge(a: (Long, Long), b: (Long, Long)) =
    (a._1 + b._1, a._2 + b._2)
}

val input: DataStream[(String, Long)] = ...

input
    .keyBy(<key selector>)
    .window(<window assigner>)
    .aggregate(new AverageAggregate)

ProcessWindowFunction

ProcessWindowFunction获得一个包含所有窗口元素的Iterable,以及一个能够访问时间和状态信息的Context 上下文对象的接口,这使它能够比其他窗口函数提供更多的灵活性。这是以性能和资源消耗为代价的,因为元素不能增量地聚合,而是需要在内部缓冲,直到窗口被认为可以处理为止。
注意:key参数是通过为keyBy()调用指定的KeySelector提取的键。在元索引键或字符串字段引用的情况下,此键类型总是Tuple,您必须手动将其转换为正确大小的元组来提取键字段。

abstract class ProcessWindowFunction[IN, OUT, KEY, W <: Window] extends Function {

  /**
    * Evaluates the window and outputs none or several elements.
    *
    * @param key      The key for which this window is evaluated.
    * @param context  The context in which the window is being evaluated.
    * @param elements The elements in the window being evaluated.
    * @param out      A collector for emitting elements.
    * @throws Exception The function may throw exceptions to fail the program and trigger recovery.
    */
  def process(
      key: KEY,
      context: Context,
      elements: Iterable[IN],
      out: Collector[OUT])

  /**
    * The context holding window metadata
    */
  abstract class Context {
    /**
      * Returns the window that is being evaluated.
      * 返回正在计算的窗口。
      */
    def window: W

    /**
      * Returns the current processing time.
      * 返回当前处理时间。
      */
    def currentProcessingTime: Long

    /**
      * Returns the current event-time watermark.
      * 返回当前事件时间水印
      */
    def currentWatermark: Long

    /**
      * State accessor for per-key and per-window state.
      * 每个键全局状态的状态访问器。
      */
    def windowState: KeyedStateStore

    /**
      * State accessor for per-key global state.
      */
    def globalState: KeyedStateStore
  }

}
val input: DataStream[(String, Long)] = ...

input
  .keyBy(_._1)
  .window(TumblingEventTimeWindows.of(Time.minutes(5)))
  .process(new MyProcessWindowFunction())

/* ... */

class MyProcessWindowFunction extends ProcessWindowFunction[(String, Long), String, String, TimeWindow] {

  def process(key: String, context: Context, input: Iterable[(String, Long)], out: Collector[String]) = {
    var count = 0L
    for (in <- input) {
      count = count + 1
    }
    out.collect(s"Window ${context.window} count: $count")
  }
}

这个例子显示了一个ProcessWindowFunction,它对窗口中的元素进行计数。此外,window函数将窗口的信息添加到输出中。

注意:使用ProcessWindowFunction进行简单的聚合(如计数)是非常低效的。下一节将展示如何将ReduceFunction或AggregateFunction与ProcessWindowFunction结合以获得增量聚合和添加ProcessWindowFunction的信息。

ProcessWindowFunction with Incremental Aggregation(处理窗口函数和增加聚合函数结合)

ProcessWindowFunction可以与ReduceFunction或AggregateFunction组合在一起,以在元素到达窗口时递增地聚合元素。当窗口关闭时,将向ProcessWindowFunction提供聚合的结果。这允许它增量地计算窗口,同时访问ProcessWindowFunction的附加窗口元信息。
注意, 您也可以使用以前的WindowFunction代替ProcessWindowFunction来进行增量窗口聚合。 但是windowFunction 没有Context 不能截取上下文

Incremental Window Aggregation with ReduceFunction(ReduceFuntione 结合 增长聚合窗口)

val input: DataStream[SensorReading] = ...

input
  .keyBy(<key selector>)
  .window(<window assigner>)
  .reduce(
    (r1: SensorReading, r2: SensorReading) => { if (r1.value > r2.value) r2 else r1 },
    ( key: String,
      context: ProcessWindowFunction[_, _, _, TimeWindow]#Context,
      minReadings: Iterable[SensorReading],
      out: Collector[(Long, SensorReading)] ) =>
      {
        val min = minReadings.iterator.next()
        out.collect((context.window.getStart, min))
      }
  )

Incremental Window Aggregation with AggregateFunction(AggregateFunction 结合增长聚合窗口)

这里的AggregateFunction 使用org.apache.flink.api.common.functions.AggregateFunction 下的如果没有此方法可以在pom 文件中添加

<dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-table-common</artifactId>
            <version>${flink.version}</version>
        </dependency>
val input: DataStream[(String, Long)] = ...

input
  .keyBy(<key selector>)
  .window(<window assigner>)
  .aggregate(new AverageAggregate(), new MyProcessWindowFunction())

// Function definitions

/**
 * The accumulator is used to keep a running sum and a count. The [getResult] method
 * computes the average.
 */
class AverageAggregate extends AggregateFunction[(String, Long), (Long, Long), Double] {
  override def createAccumulator() = (0L, 0L)

  override def add(value: (String, Long), accumulator: (Long, Long)) =
    (accumulator._1 + value._2, accumulator._2 + 1L)

  override def getResult(accumulator: (Long, Long)) = accumulator._1 / accumulator._2

  override def merge(a: (Long, Long), b: (Long, Long)) =
    (a._1 + b._1, a._2 + b._2)
}

class MyProcessWindowFunction extends ProcessWindowFunction[Double, (String, Double), String, TimeWindow] {

  def process(key: String, context: Context, averages: Iterable[Double], out: Collector[(String, Double)]) = {
    val average = averages.iterator.next()
    out.collect((key, average))
  }
}

Using per-window state in ProcessWindowFunction

除了访问键控状态(任何rich function都可以)之外,ProcessWindowFunction还可以使用作用域为函数当前正在处理的窗口的键控状态。在这种情况下,理解每个窗口状态所指的是什么窗口是很重要的。这里有不同的“窗口”:

  • 当指定窗口操作时定义的窗口:这可能是滚动1小时的窗口或滑动2小时的窗口,滑动1小时。
  • 给定键的定义窗口的实际实例:这可能是用户id xyz从12:00到13:00的时间窗口。这是基于窗口定义的,并且会有许多窗口是基于作业当前正在处理的键的数量,以及基于事件所处的时间段。

每个窗口的状态与后者相关联。这意味着如果我们为1000个不同的键处理事件,并且所有的事件当前都在[12:00,13:00)时间窗口内,那么将会有1000个窗口实例,每个实例都有自己的键控窗口状态。
process()调用接收到的Context对象上有两种方法允许访问这两种状态:

  • globalState(),它允许访问不在窗口作用域内的键控状态
  • windowState(),它允许访问同样作用域为窗口的键控状态

如果您预期同一个窗口会有多个触发,那么这个特性是有用的,因为当您对延迟到达的数据有延迟触发时,或者当您有一个自定义触发器执行推测性的早期触发时,就会发生这种情况。在这种情况下,您需要在每个窗口状态中存储关于以前的触发或触发次数的信息。
当使用窗口状态时,在窗口被清除时清除该状态也很重要。这应该在clear()方法中发生。

WindowFunction (Legacy)

在一些可以使用ProcessWindowFunction的地方,你也可以使用WindowFunction。这是ProcessWindowFunction的一个旧版本,它提供较少的上下文信息,并且没有一些高级特性,比如每个窗口的键控状态。这个接口将在某个时候被弃用。

trait WindowFunction[IN, OUT, KEY, W <: Window] extends Function with Serializable {

  /**
    * Evaluates the window and outputs none or several elements.
    *
    * @param key    The key for which this window is evaluated.
    * @param window The window that is being evaluated.
    * @param input  The elements in the window being evaluated.
    * @param out    A collector for emitting elements.
    * @throws Exception The function may throw exceptions to fail the program and trigger recovery.
    */
  def apply(key: KEY, window: W, input: Iterable[IN], out: Collector[OUT])
}
val input: DataStream[(String, Long)] = ...

input
    .keyBy(<key selector>)
    .window(<window assigner>)
    .apply(new MyWindowFunction())

Keyed Windows

stream
 .keyBy(…) <- keyed versus non-keyed windows
 .window(…) <- required: “assigner”
 [.trigger(…)] <- optional: “trigger” (else default trigger)
 [.evictor(…)] <- optional: “evictor” (else no evictor)
 [.allowedLateness(…)] <- optional: “lateness” (else zero)
 [.sideOutputLateData(…)] <- optional: “output tag” (else no side output for late data)
 .reduce/aggregate/fold/apply() <- required: “function”
 [.getSideOutput(…)] <- optional: “output tag”