windows环境搭建flink flink windowfunction

转载

mob64ca1417b0c6 2024-03-26 09:49:11

文章标签 windows环境搭建flink flink ide lua 文章分类 架构后端开发

窗口函数的使用

Windows Function

ReduceFunction
AggregateFunction
ProcessWindowFunction
ProcessWindowFunction with Incremental Aggregation（处理窗口函数和增加聚合函数结合）

Incremental Window Aggregation with ReduceFunction（ReduceFuntione 结合增长聚合窗口）
Incremental Window Aggregation with AggregateFunction（AggregateFunction 结合增长聚合窗口）

Using per-window state in ProcessWindowFunction
WindowFunction (Legacy)
Keyed Windows

Windows Function

ReduceFunction

ReduceFunction指定如何组合输入中的两个元素以生成相同类型的输出元素。Flink使用ReduceFunction递增聚合窗口的元素。
e.g

输入类型要和输出类型一致

val input: DataStream[(String, Long)] = ...

input
    .keyBy(<key selector>)
    .window(<window assigner>)
    .reduce { (v1, v2) => (v1._1, v1._2 + v2._2) }

AggregateFunction

AggregateFunction是ReduceFunction的通用版本，它有三种类型:输入类型(IN)、累加器类型(ACC)和输出类型(OUT)。输入类型是输入流中元素的类型，AggregateFunction有一个将一个输入元素添加到累加器的方法。该接口还有用于创建初始累加器、将两个累加器合并为一个累加器以及从累加器提取输出(OUT类型)的方法。我们将在下面的例子中看到它是如何工作的。
与ReduceFunction一样，Flink将在窗口的输入元素到达时递增地聚合它们。

/**
 * The accumulator is used to keep a running sum and a count. The [getResult] method
 * computes the average.
 */
class AverageAggregate extends AggregateFunction[(String, Long), (Long, Long), Double] {
  override def createAccumulator() = (0L, 0L)

  override def add(value: (String, Long), accumulator: (Long, Long)) =
    (accumulator._1 + value._2, accumulator._2 + 1L)

  override def getResult(accumulator: (Long, Long)) = accumulator._1 / accumulator._2

  override def merge(a: (Long, Long), b: (Long, Long)) =
    (a._1 + b._1, a._2 + b._2)
}

val input: DataStream[(String, Long)] = ...

input
    .keyBy(<key selector>)
    .window(<window assigner>)
    .aggregate(new AverageAggregate)

ProcessWindowFunction

ProcessWindowFunction获得一个包含所有窗口元素的Iterable，以及一个能够访问时间和状态信息的Context 上下文对象的接口，这使它能够比其他窗口函数提供更多的灵活性。这是以性能和资源消耗为代价的，因为元素不能增量地聚合，而是需要在内部缓冲，直到窗口被认为可以处理为止。
注意:key参数是通过为keyBy()调用指定的KeySelector提取的键。在元索引键或字符串字段引用的情况下，此键类型总是Tuple，您必须手动将其转换为正确大小的元组来提取键字段。

abstract class ProcessWindowFunction[IN, OUT, KEY, W <: Window] extends Function {

  /**
    * Evaluates the window and outputs none or several elements.
    *
    * @param key      The key for which this window is evaluated.
    * @param context  The context in which the window is being evaluated.
    * @param elements The elements in the window being evaluated.
    * @param out      A collector for emitting elements.
    * @throws Exception The function may throw exceptions to fail the program and trigger recovery.
    */
  def process(
      key: KEY,
      context: Context,
      elements: Iterable[IN],
      out: Collector[OUT])

  /**
    * The context holding window metadata
    */
  abstract class Context {
    /**
      * Returns the window that is being evaluated.
      * 返回正在计算的窗口。
      */
    def window: W

    /**
      * Returns the current processing time.
      * 返回当前处理时间。
      */
    def currentProcessingTime: Long

    /**
      * Returns the current event-time watermark.
      * 返回当前事件时间水印
      */
    def currentWatermark: Long

    /**
      * State accessor for per-key and per-window state.
      * 每个键全局状态的状态访问器。
      */
    def windowState: KeyedStateStore

    /**
      * State accessor for per-key global state.
      */
    def globalState: KeyedStateStore
  }

}

val input: DataStream[(String, Long)] = ...

input
  .keyBy(_._1)
  .window(TumblingEventTimeWindows.of(Time.minutes(5)))
  .process(new MyProcessWindowFunction())

/* ... */

class MyProcessWindowFunction extends ProcessWindowFunction[(String, Long), String, String, TimeWindow] {

  def process(key: String, context: Context, input: Iterable[(String, Long)], out: Collector[String]) = {
    var count = 0L
    for (in <- input) {
      count = count + 1
    }
    out.collect(s"Window ${context.window} count: $count")
  }
}

这个例子显示了一个ProcessWindowFunction，它对窗口中的元素进行计数。此外，window函数将窗口的信息添加到输出中。

注意:使用ProcessWindowFunction进行简单的聚合(如计数)是非常低效的。下一节将展示如何将ReduceFunction或AggregateFunction与ProcessWindowFunction结合以获得增量聚合和添加ProcessWindowFunction的信息。

ProcessWindowFunction with Incremental Aggregation（处理窗口函数和增加聚合函数结合）

ProcessWindowFunction可以与ReduceFunction或AggregateFunction组合在一起，以在元素到达窗口时递增地聚合元素。当窗口关闭时，将向ProcessWindowFunction提供聚合的结果。这允许它增量地计算窗口，同时访问ProcessWindowFunction的附加窗口元信息。
注意，您也可以使用以前的WindowFunction代替ProcessWindowFunction来进行增量窗口聚合。但是windowFunction 没有Context 不能截取上下文

Incremental Window Aggregation with ReduceFunction（ReduceFuntione 结合增长聚合窗口）

val input: DataStream[SensorReading] = ...

input
  .keyBy(<key selector>)
  .window(<window assigner>)
  .reduce(
    (r1: SensorReading, r2: SensorReading) => { if (r1.value > r2.value) r2 else r1 },
    ( key: String,
      context: ProcessWindowFunction[_, _, _, TimeWindow]#Context,
      minReadings: Iterable[SensorReading],
      out: Collector[(Long, SensorReading)] ) =>
      {
        val min = minReadings.iterator.next()
        out.collect((context.window.getStart, min))
      }
  )

Incremental Window Aggregation with AggregateFunction（AggregateFunction 结合增长聚合窗口）

这里的AggregateFunction 使用org.apache.flink.api.common.functions.AggregateFunction 下的如果没有此方法可以在pom 文件中添加

<dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-table-common</artifactId>
            <version>${flink.version}</version>
        </dependency>

val input: DataStream[(String, Long)] = ...

input
  .keyBy(<key selector>)
  .window(<window assigner>)
  .aggregate(new AverageAggregate(), new MyProcessWindowFunction())

// Function definitions

/**
 * The accumulator is used to keep a running sum and a count. The [getResult] method
 * computes the average.
 */
class AverageAggregate extends AggregateFunction[(String, Long), (Long, Long), Double] {
  override def createAccumulator() = (0L, 0L)

  override def add(value: (String, Long), accumulator: (Long, Long)) =
    (accumulator._1 + value._2, accumulator._2 + 1L)

  override def getResult(accumulator: (Long, Long)) = accumulator._1 / accumulator._2

  override def merge(a: (Long, Long), b: (Long, Long)) =
    (a._1 + b._1, a._2 + b._2)
}

class MyProcessWindowFunction extends ProcessWindowFunction[Double, (String, Double), String, TimeWindow] {

  def process(key: String, context: Context, averages: Iterable[Double], out: Collector[(String, Double)]) = {
    val average = averages.iterator.next()
    out.collect((key, average))
  }
}

Using per-window state in ProcessWindowFunction

除了访问键控状态(任何rich function都可以)之外，ProcessWindowFunction还可以使用作用域为函数当前正在处理的窗口的键控状态。在这种情况下，理解每个窗口状态所指的是什么窗口是很重要的。这里有不同的“窗口”:

当指定窗口操作时定义的窗口:这可能是滚动1小时的窗口或滑动2小时的窗口，滑动1小时。
给定键的定义窗口的实际实例:这可能是用户id xyz从12:00到13:00的时间窗口。这是基于窗口定义的，并且会有许多窗口是基于作业当前正在处理的键的数量，以及基于事件所处的时间段。

每个窗口的状态与后者相关联。这意味着如果我们为1000个不同的键处理事件，并且所有的事件当前都在[12:00,13:00)时间窗口内，那么将会有1000个窗口实例，每个实例都有自己的键控窗口状态。
process()调用接收到的Context对象上有两种方法允许访问这两种状态:

globalState()，它允许访问不在窗口作用域内的键控状态
windowState()，它允许访问同样作用域为窗口的键控状态

如果您预期同一个窗口会有多个触发，那么这个特性是有用的，因为当您对延迟到达的数据有延迟触发时，或者当您有一个自定义触发器执行推测性的早期触发时，就会发生这种情况。在这种情况下，您需要在每个窗口状态中存储关于以前的触发或触发次数的信息。
当使用窗口状态时，在窗口被清除时清除该状态也很重要。这应该在clear()方法中发生。

WindowFunction (Legacy)

在一些可以使用ProcessWindowFunction的地方，你也可以使用WindowFunction。这是ProcessWindowFunction的一个旧版本，它提供较少的上下文信息，并且没有一些高级特性，比如每个窗口的键控状态。这个接口将在某个时候被弃用。

trait WindowFunction[IN, OUT, KEY, W <: Window] extends Function with Serializable {

  /**
    * Evaluates the window and outputs none or several elements.
    *
    * @param key    The key for which this window is evaluated.
    * @param window The window that is being evaluated.
    * @param input  The elements in the window being evaluated.
    * @param out    A collector for emitting elements.
    * @throws Exception The function may throw exceptions to fail the program and trigger recovery.
    */
  def apply(key: KEY, window: W, input: Iterable[IN], out: Collector[OUT])
}

val input: DataStream[(String, Long)] = ...

input
    .keyBy(<key selector>)
    .window(<window assigner>)
    .apply(new MyWindowFunction())

Keyed Windows

stream
 .keyBy(…) <- keyed versus non-keyed windows
 .window(…) <- required: “assigner”
 [.trigger(…)] <- optional: “trigger” (else default trigger)
 [.evictor(…)] <- optional: “evictor” (else no evictor)
 [.allowedLateness(…)] <- optional: “lateness” (else zero)
 [.sideOutputLateData(…)] <- optional: “output tag” (else no side output for late data)
 .reduce/aggregate/fold/apply() <- required: “function”
 [.getSideOutput(…)] <- optional: “output tag”

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。