窗口函数的使用
- Windows Function
- ReduceFunction
- AggregateFunction
- ProcessWindowFunction
- ProcessWindowFunction with Incremental Aggregation(处理窗口函数和增加聚合函数结合)
- Incremental Window Aggregation with ReduceFunction(ReduceFuntione 结合 增长聚合窗口)
- Incremental Window Aggregation with AggregateFunction(AggregateFunction 结合增长聚合窗口)
- Using per-window state in ProcessWindowFunction
- WindowFunction (Legacy)
- Keyed Windows
Windows Function
ReduceFunction
ReduceFunction指定如何组合输入中的两个元素以生成相同类型的输出元素。Flink使用ReduceFunction递增聚合窗口的元素。
e.g
输入类型要和输出类型一致
val input: DataStream[(String, Long)] = ...
input
.keyBy(<key selector>)
.window(<window assigner>)
.reduce { (v1, v2) => (v1._1, v1._2 + v2._2) }
AggregateFunction
AggregateFunction是ReduceFunction的通用版本,它有三种类型:输入类型(IN)、累加器类型(ACC)和输出类型(OUT)。输入类型是输入流中元素的类型,AggregateFunction有一个将一个输入元素添加到累加器的方法。该接口还有用于创建初始累加器、将两个累加器合并为一个累加器以及从累加器提取输出(OUT类型)的方法。我们将在下面的例子中看到它是如何工作的。
与ReduceFunction一样,Flink将在窗口的输入元素到达时递增地聚合它们。
/**
* The accumulator is used to keep a running sum and a count. The [getResult] method
* computes the average.
*/
class AverageAggregate extends AggregateFunction[(String, Long), (Long, Long), Double] {
override def createAccumulator() = (0L, 0L)
override def add(value: (String, Long), accumulator: (Long, Long)) =
(accumulator._1 + value._2, accumulator._2 + 1L)
override def getResult(accumulator: (Long, Long)) = accumulator._1 / accumulator._2
override def merge(a: (Long, Long), b: (Long, Long)) =
(a._1 + b._1, a._2 + b._2)
}
val input: DataStream[(String, Long)] = ...
input
.keyBy(<key selector>)
.window(<window assigner>)
.aggregate(new AverageAggregate)
ProcessWindowFunction
ProcessWindowFunction获得一个包含所有窗口元素的Iterable,以及一个能够访问时间和状态信息的Context 上下文对象的接口,这使它能够比其他窗口函数提供更多的灵活性。这是以性能和资源消耗为代价的,因为元素不能增量地聚合,而是需要在内部缓冲,直到窗口被认为可以处理为止。
注意:key参数是通过为keyBy()调用指定的KeySelector提取的键。在元索引键或字符串字段引用的情况下,此键类型总是Tuple,您必须手动将其转换为正确大小的元组来提取键字段。
abstract class ProcessWindowFunction[IN, OUT, KEY, W <: Window] extends Function {
/**
* Evaluates the window and outputs none or several elements.
*
* @param key The key for which this window is evaluated.
* @param context The context in which the window is being evaluated.
* @param elements The elements in the window being evaluated.
* @param out A collector for emitting elements.
* @throws Exception The function may throw exceptions to fail the program and trigger recovery.
*/
def process(
key: KEY,
context: Context,
elements: Iterable[IN],
out: Collector[OUT])
/**
* The context holding window metadata
*/
abstract class Context {
/**
* Returns the window that is being evaluated.
* 返回正在计算的窗口。
*/
def window: W
/**
* Returns the current processing time.
* 返回当前处理时间。
*/
def currentProcessingTime: Long
/**
* Returns the current event-time watermark.
* 返回当前事件时间水印
*/
def currentWatermark: Long
/**
* State accessor for per-key and per-window state.
* 每个键全局状态的状态访问器。
*/
def windowState: KeyedStateStore
/**
* State accessor for per-key global state.
*/
def globalState: KeyedStateStore
}
}
val input: DataStream[(String, Long)] = ...
input
.keyBy(_._1)
.window(TumblingEventTimeWindows.of(Time.minutes(5)))
.process(new MyProcessWindowFunction())
/* ... */
class MyProcessWindowFunction extends ProcessWindowFunction[(String, Long), String, String, TimeWindow] {
def process(key: String, context: Context, input: Iterable[(String, Long)], out: Collector[String]) = {
var count = 0L
for (in <- input) {
count = count + 1
}
out.collect(s"Window ${context.window} count: $count")
}
}
这个例子显示了一个ProcessWindowFunction,它对窗口中的元素进行计数。此外,window函数将窗口的信息添加到输出中。
注意:使用ProcessWindowFunction进行简单的聚合(如计数)是非常低效的。下一节将展示如何将ReduceFunction或AggregateFunction与ProcessWindowFunction结合以获得增量聚合和添加ProcessWindowFunction的信息。
ProcessWindowFunction with Incremental Aggregation(处理窗口函数和增加聚合函数结合)
ProcessWindowFunction可以与ReduceFunction或AggregateFunction组合在一起,以在元素到达窗口时递增地聚合元素。当窗口关闭时,将向ProcessWindowFunction提供聚合的结果。这允许它增量地计算窗口,同时访问ProcessWindowFunction的附加窗口元信息。
注意, 您也可以使用以前的WindowFunction代替ProcessWindowFunction来进行增量窗口聚合。 但是windowFunction 没有Context 不能截取上下文
Incremental Window Aggregation with ReduceFunction(ReduceFuntione 结合 增长聚合窗口)
val input: DataStream[SensorReading] = ...
input
.keyBy(<key selector>)
.window(<window assigner>)
.reduce(
(r1: SensorReading, r2: SensorReading) => { if (r1.value > r2.value) r2 else r1 },
( key: String,
context: ProcessWindowFunction[_, _, _, TimeWindow]#Context,
minReadings: Iterable[SensorReading],
out: Collector[(Long, SensorReading)] ) =>
{
val min = minReadings.iterator.next()
out.collect((context.window.getStart, min))
}
)
Incremental Window Aggregation with AggregateFunction(AggregateFunction 结合增长聚合窗口)
这里的AggregateFunction 使用org.apache.flink.api.common.functions.AggregateFunction 下的如果没有此方法可以在pom 文件中添加
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-common</artifactId>
<version>${flink.version}</version>
</dependency>
val input: DataStream[(String, Long)] = ...
input
.keyBy(<key selector>)
.window(<window assigner>)
.aggregate(new AverageAggregate(), new MyProcessWindowFunction())
// Function definitions
/**
* The accumulator is used to keep a running sum and a count. The [getResult] method
* computes the average.
*/
class AverageAggregate extends AggregateFunction[(String, Long), (Long, Long), Double] {
override def createAccumulator() = (0L, 0L)
override def add(value: (String, Long), accumulator: (Long, Long)) =
(accumulator._1 + value._2, accumulator._2 + 1L)
override def getResult(accumulator: (Long, Long)) = accumulator._1 / accumulator._2
override def merge(a: (Long, Long), b: (Long, Long)) =
(a._1 + b._1, a._2 + b._2)
}
class MyProcessWindowFunction extends ProcessWindowFunction[Double, (String, Double), String, TimeWindow] {
def process(key: String, context: Context, averages: Iterable[Double], out: Collector[(String, Double)]) = {
val average = averages.iterator.next()
out.collect((key, average))
}
}
Using per-window state in ProcessWindowFunction
除了访问键控状态(任何rich function都可以)之外,ProcessWindowFunction还可以使用作用域为函数当前正在处理的窗口的键控状态。在这种情况下,理解每个窗口状态所指的是什么窗口是很重要的。这里有不同的“窗口”:
- 当指定窗口操作时定义的窗口:这可能是滚动1小时的窗口或滑动2小时的窗口,滑动1小时。
- 给定键的定义窗口的实际实例:这可能是用户id xyz从12:00到13:00的时间窗口。这是基于窗口定义的,并且会有许多窗口是基于作业当前正在处理的键的数量,以及基于事件所处的时间段。
每个窗口的状态与后者相关联。这意味着如果我们为1000个不同的键处理事件,并且所有的事件当前都在[12:00,13:00)时间窗口内,那么将会有1000个窗口实例,每个实例都有自己的键控窗口状态。
process()调用接收到的Context对象上有两种方法允许访问这两种状态:
- globalState(),它允许访问不在窗口作用域内的键控状态
- windowState(),它允许访问同样作用域为窗口的键控状态
如果您预期同一个窗口会有多个触发,那么这个特性是有用的,因为当您对延迟到达的数据有延迟触发时,或者当您有一个自定义触发器执行推测性的早期触发时,就会发生这种情况。在这种情况下,您需要在每个窗口状态中存储关于以前的触发或触发次数的信息。
当使用窗口状态时,在窗口被清除时清除该状态也很重要。这应该在clear()方法中发生。
WindowFunction (Legacy)
在一些可以使用ProcessWindowFunction的地方,你也可以使用WindowFunction。这是ProcessWindowFunction的一个旧版本,它提供较少的上下文信息,并且没有一些高级特性,比如每个窗口的键控状态。这个接口将在某个时候被弃用。
trait WindowFunction[IN, OUT, KEY, W <: Window] extends Function with Serializable {
/**
* Evaluates the window and outputs none or several elements.
*
* @param key The key for which this window is evaluated.
* @param window The window that is being evaluated.
* @param input The elements in the window being evaluated.
* @param out A collector for emitting elements.
* @throws Exception The function may throw exceptions to fail the program and trigger recovery.
*/
def apply(key: KEY, window: W, input: Iterable[IN], out: Collector[OUT])
}
val input: DataStream[(String, Long)] = ...
input
.keyBy(<key selector>)
.window(<window assigner>)
.apply(new MyWindowFunction())
Keyed Windows
stream
.keyBy(…) <- keyed versus non-keyed windows
.window(…) <- required: “assigner”
[.trigger(…)] <- optional: “trigger” (else default trigger)
[.evictor(…)] <- optional: “evictor” (else no evictor)
[.allowedLateness(…)] <- optional: “lateness” (else zero)
[.sideOutputLateData(…)] <- optional: “output tag” (else no side output for late data)
.reduce/aggregate/fold/apply() <- required: “function”
[.getSideOutput(…)] <- optional: “output tag”