今天我们来看看在流式SQL中值得注意的一个技术点,不同的SQL会产生不同类型的输出。
来看两个SQL,带窗口的GroupBy,
不带窗口的GroupBy,
这两条SQL会产生不同类型的输出,
- 带窗口的GroupBy,对于产生的结果,只要不断进行Append就可以了,因为时间一直在推进(这里不考虑因数据延迟而需要对已输出结果进行修正);
- 不带窗口的GroupBy,对于产生的结果,只要有同一分组数据输入就需要对已输出结果进行修正(与分组状态数据存储时效也有关系);
而修正的方式也有多种,
- 修正带主键的结果数据,直接更新原结果,或者删除原结果,写入新结果;
- 修正不带主键的结果数据,删除原结果,写入新结果;
因此在流式SQL中需要有不同类型的TableSink来支持不同的SQL。以Flink为例,Flink有3种流式的TableSink,
-
AppendStreamTableSink
,对应上文所说的,只需要不断Append的输出类型; -
RetractStreamTableSink
,对应上文所说的,可以修正不带主键结果的输出类型,或者以删除原结果写入新结果方式修正带主键结果的输出类型; -
UpsertStreamTableSink
,对应上文所说的,可以用更新方式修正带主键结果的输出类型;
那么,什么样的SQL需要什么样的输出类型呢?还是来看Flink的实现,StreamTableEnvironment#writeToSink,
AppendStreamTableSink
case appendSink: AppendStreamTableSink[_] =>
// optimize plan
val optimizedPlan = optimize(table.getRelNode, updatesAsRetraction = false)
// verify table is an insert-only (append-only) table
if (!UpdatingPlanChecker.isAppendOnly(optimizedPlan)) {
throw new TableException(
"AppendStreamTableSink requires that Table has only insert changes.")
}
val outputType = sink.getOutputType
val resultType = getResultType(table.getRelNode, optimizedPlan)
// translate the Table into a DataStream and provide the type that the TableSink expects.
val result: DataStream[T] =
translate(
optimizedPlan,
resultType,
streamQueryConfig,
withChangeFlag = false)(outputType)
// Give the DataStream to the TableSink to emit it.
appendSink.asInstanceOf[AppendStreamTableSink[T]].emitDataStream(result)
只有UpdatingPlanChecker#isAppendOnly为true
的SQL才能使用AppendStreamTableSink
,
/** Validates that the plan produces only append changes. */
def isAppendOnly(plan: RelNode): Boolean = {
val appendOnlyValidator = new AppendOnlyValidator
appendOnlyValidator.go(plan)
appendOnlyValidator.isAppendOnly
}
private class AppendOnlyValidator extends RelVisitor {
var isAppendOnly = true
override def visit(node: RelNode, ordinal: Int, parent: RelNode): Unit = {
node match {
case s: DataStreamRel if s.producesUpdates || s.producesRetractions =>
isAppendOnly = false
case _ =>
super.visit(node, ordinal, parent)
}
}
}
只有DataStreamGroupAggregate#producesUpdates为true
,也就是上文所说的,不带窗口的GroupBy产生的结果需要进行Update,不是Append Only的。
而只有DataStreamJoin#producesRetractions有可能为true
,
// outer join will generate retractions
override def producesRetractions: Boolean = joinType != JoinRelType.INNER
也就是说Outer Join的结果有可能需要进行修正,这与Flink的流式Join实现有关,这里就不展开了。
RetractStreamTableSink
case retractSink: RetractStreamTableSink[_] =>
// retraction sink can always be used
val outputType = sink.getOutputType
// translate the Table into a DataStream and provide the type that the TableSink expects.
val result: DataStream[T] =
translate(
table,
streamQueryConfig,
updatesAsRetraction = true,
withChangeFlag = true)(outputType)
// Give the DataStream to the TableSink to emit it.
retractSink.asInstanceOf[RetractStreamTableSink[Any]]
.emitDataStream(result.asInstanceOf[DataStream[JTuple2[JBool, Any]]])
可以看到RetractStreamTableSink
没有什么限制,所有SQL都可以使用。
UpsertStreamTableSink
case upsertSink: UpsertStreamTableSink[_] =>
// optimize plan
val optimizedPlan = optimize(table.getRelNode, updatesAsRetraction = false)
// check for append only table
val isAppendOnlyTable = UpdatingPlanChecker.isAppendOnly(optimizedPlan)
upsertSink.setIsAppendOnly(isAppendOnlyTable)
// extract unique key fields
val tableKeys: Option[Array[String]] = UpdatingPlanChecker.getUniqueKeyFields(optimizedPlan)
// check that we have keys if the table has changes (is not append-only)
tableKeys match {
case Some(keys) => upsertSink.setKeyFields(keys)
case None if isAppendOnlyTable => upsertSink.setKeyFields(null)
case None if !isAppendOnlyTable => throw new TableException(
"UpsertStreamTableSink requires that Table has a full primary keys if it is updated.")
}
val outputType = sink.getOutputType
val resultType = getResultType(table.getRelNode, optimizedPlan)
// translate the Table into a DataStream and provide the type that the TableSink expects.
val result: DataStream[T] =
translate(
optimizedPlan,
resultType,
streamQueryConfig,
withChangeFlag = true)(outputType)
// Give the DataStream to the TableSink to emit it.
upsertSink.asInstanceOf[UpsertStreamTableSink[Any]]
.emitDataStream(result.asInstanceOf[DataStream[JTuple2[JBool, Any]]])
可以看到,如果不是Append Only的SQL,则需要有主键才能使用UpsertStreamTableSink
,主键是通过UpdatingPlanChecker#getUniqueKeyFields获取的。举个栗子,上文所说的,不带窗口的GroupBy,使用的grouping key就是主键,
case a: DataStreamGroupAggregate =>
// get grouping keys
val groupKeys = a.getRowType.getFieldNames.take(a.getGroupings.length)
Some(groupKeys.map(e => (e, e)))
具体也不展开了,感兴趣的同学可以自己看代码:)
参考资料