1、环境

  • spark 2.4.0
  • scala 2.11.8
  • jdk 1.8
  • maven
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_2.11</artifactId>
<version>2.4.0</version>
</dependency>

2、源码说明

package org.apache.spark.sql.streaming;

import org.apache.spark.annotation.InterfaceStability;
import org.apache.spark.sql.catalyst.streaming.InternalOutputModes;

/**
* OutputMode describes what data will be written to a streaming sink when there is
* new data available in a streaming DataFrame/Dataset.
*
* @since 2.0.0
*/
@InterfaceStability.Evolving
public class OutputMode {

/**
* OutputMode in which only the new rows in the streaming DataFrame/Dataset will be
* written to the sink. This output mode can be only be used in queries that do not
* contain any aggregation.
*
* @since 2.0.0
*/
public static OutputMode Append() {
return InternalOutputModes.Append$.MODULE$;
}

/**
* OutputMode in which all the rows in the streaming DataFrame/Dataset will be written
* to the sink every time there are some updates. This output mode can only be used in queries
* that contain aggregations.
*
* @since 2.0.0
*/
public static OutputMode Complete() {
return InternalOutputModes.Complete$.MODULE$;
}

/**
* OutputMode in which only the rows that were updated in the streaming DataFrame/Dataset will
* be written to the sink every time there are some updates. If the query doesn't contain
* aggregations, it will be equivalent to `Append` mode.
*
* @since 2.1.1
*/
public static OutputMode Update() {
return InternalOutputModes.Update$.MODULE$;
}
}

3、区别

  • Append :默认模式这种模式保证每一行只输出一次(假设是容错接收器)。只输出结果表中本批次新增的数据,即本批次中的数据。支持使用:select、where、map、flatMap、filter、join等的查询。
  • Completed:每次触发后,输出最新的完整的结果表数据。 支持使用:聚合查询。
  • Updated:(自Spark 2.1.1起可用)只输出结果表中被本批次修改的数据。 

 

4、参考

​http://spark.apache.org/docs/2.4.0/structured-streaming-programming-guide.html#output-modes​