前言
从文章标题看出这是个窗口的join讲解文章,我们还是从官网的例子说起
GitHub地址:flink-learn
Windows Join
窗口连接连接两个共享公共密钥并位于同一窗口中的流的元素。可以使用窗口分配器定义这些窗口,并对来自两个流的元素进行评估。
然后将来自双方的元素传递给用户定义的,JoinFunction或者FlatJoinFunction用户可以发出满足连接条件的结果。
一般用法可概括如下:
stream.join(otherStream)
.where(<KeySelector>)
.equalTo(<KeySelector>)
.window(<WindowAssigner>)
.apply(<JoinFunction>)
关于语义的一些注释:
- 两个流的元素的成对组合的创建表现得像内部连接,意味着如果它们没有来自要连接的另一个流的对应元素,则不会发出来自一个流的元素。
- 那些加入的元素将在其时间戳中包含仍位于相应窗口中的最大时间戳。例如,[5, 10)具有其边界的窗口将导致连接的元素具有9作为其时间戳。
以上是官网的翻译:从上面我们看到一个关键词“创建表现得像内部连接”也就是我们通常所说的Inner Join ,然后,Windows Join的元素必须属于同一个窗口的,不同窗口间的元素是不能Join的
接下来我们看官网例子:WindowJoin
//定义了一个Grade数据源的类型,一个Salary数据源的类型,一个Person输出的类型
case class Grade(name: String, grade: Int)
case class Salary(name: String, salary: Int)
case class Person(name: String, grade: Int, salary: Int)
/* eg:Grade(职等)
*((tom 3),
* (jerry 4),
* (alice 5),
* (bob 6),
* (john 2),
* (grace 3))
* /
/* eg:Salary(薪资)
*((tom 8000),
* (jerry 6000),
* (alice 5000),
* (bob 3000),
* (john 10000),
* (grace 8000))
* /
/*eg:Person
* Person(bob,6,3000)
* Person(grace,3,8000)
* Person(grace,3,8000)
* Person(grace,3,8000)
* Person(tom,3,8000)
* Person(tom,3,8000)
* * /
- 1.参数设置
//解析传入程序的参数
val params = ParameterTool.fromArgs(args)
//定义窗口大小
val windowSize = params.getLong("windowSize", 2000)
//主要调节元素产生速度的参数(详看ThrottledIterator类)
val rate = params.getLong("rate", 3)
- 2.flink驱动搞起
// 获取执行环境,在“摄取时间”中运行此示例
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime)
// 在Web界面中提供参数
env.getConfig.setGlobalJobParameters(params)
- 3.数据源搞起来
// 为成绩和薪水创建数据源
val grades = WindowJoinSampleData.getGradeSource(env, rate)
val salaries = WindowJoinSampleData.getSalarySource(env, rate)
WindowJoinSampleData
import java.io.Serializable
import java.util.Random
import WindowJoin.{Grade, Salary}
import org.apache.flink.streaming.api.scala._
import scala.collection.JavaConverters._
/**
* @Author: king
* @Date: 2019-01-14
* @Desc: TODO
*/
object WindowJoinSampleData {
private[join] val NAMES = Array("tom", "jerry", "alice", "bob", "john", "grace")
private[join] val GRADE_COUNT = 5
private[join] val SALARY_MAX = 100000
/**
* Continuously generates (name, grade).
*/
def getGradeSource(env: StreamExecutionEnvironment, rate: Long): DataStream[Grade] = {
env.fromCollection(new ThrottledIterator(new GradeSource().asJava, rate).asScala)
}
/**
* Continuously generates (name, salary).
*/
def getSalarySource(env: StreamExecutionEnvironment, rate: Long): DataStream[Salary] = {
//此处注意!引入新的类了ThrottledIterator
env.fromCollection(new ThrottledIterator(new SalarySource().asJava, rate).asScala)
}
// --------------------------------------------------------------------------
class GradeSource extends Iterator[Grade] with Serializable {
//可以看出我们的数据源都是随机搞出来的
private[this] val rnd = new Random(hashCode())
def hasNext: Boolean = true
def next: Grade = {
Grade(NAMES(rnd.nextInt(NAMES.length)), rnd.nextInt(GRADE_COUNT) + 1)
}
}
class SalarySource extends Iterator[Salary] with Serializable {
private[this] val rnd = new Random(hashCode())
def hasNext: Boolean = true
def next: Salary = {
Salary(NAMES(rnd.nextInt(NAMES.length)), rnd.nextInt(SALARY_MAX) + 1)
}
}
}
从创建数据源类又延伸出一个类ThrottledIterator(程序控制类)
package com.king.learn.Flink.streaming.join;
import java.io.Serializable;
import java.util.Iterator;
/**
* @Author: king
* @Date: 2019-01-14
* @Desc: TODO
*/
public class ThrottledIterator<T> implements Iterator<T>, Serializable {
private static final long serialVersionUID = -694284704712549217L;
@SuppressWarnings("NonSerializableFieldInSerializableClass")
private final Iterator<T> source;
private final long sleepBatchSize;
private final long sleepBatchTime;
private long lastBatchCheckTime;
private long num;
public ThrottledIterator(Iterator<T> source, long elementsPerSecond) {
this.source = source;
if (!(source instanceof Serializable)) {
throw new IllegalArgumentException("source must be java.io.Serializable");
}
if (elementsPerSecond >= 100) {
// how many elements would we emit per 50ms
this.sleepBatchSize = elementsPerSecond / 20;
this.sleepBatchTime = 50;
} else if (elementsPerSecond >= 1) {
// how long does element take
this.sleepBatchSize = 1;
this.sleepBatchTime = 1000 / elementsPerSecond;
} else {
throw new IllegalArgumentException("'elements per second' must be positive and not zero");
}
}
@Override
public boolean hasNext() {
return source.hasNext();
}
@Override
public T next() {
// delay if necessary
if (lastBatchCheckTime > 0) {
if (++num >= sleepBatchSize) {
num = 0;
final long now = System.currentTimeMillis();
final long elapsed = now - lastBatchCheckTime;
if (elapsed < sleepBatchTime) {
try {
Thread.sleep(sleepBatchTime - elapsed);
} catch (InterruptedException e) {
// restore interrupt flag and proceed
Thread.currentThread().interrupt();
}
}
lastBatchCheckTime = now;
}
} else {
lastBatchCheckTime = System.currentTimeMillis();
}
return source.next();
}
@Override
public void remove() {
throw new UnsupportedOperationException();
}
}
继续不要被两个辅助类所困扰
- 4.既然有两个数据源了,那么就开始join了
// 在窗口上按名称加入两个输入流。
// 为了测试性,此功能在一个单独的函数中。
val joined = joinStreams(grades, salaries, windowSize)
//joinStreams函数
def joinStreams(grades: DataStream[Grade],
salaries: DataStream[Salary], windowSize: Long) : DataStream[Person] = {
grades.join(salaries)
.where(_.name)
.equalTo(_.name)
.window(TumblingEventTimeWindows.of(Time.milliseconds(windowSize)))
.apply { (g, s) => Person(g.name, g.grade, s.salary) }
}
5 打印结果,执行flink程序
// 使用单个线程打印结果,而不是并行打印
joined.print().setParallelism(1)
// 执行程序
env.execute("Windowed Join Example")
- 6.输出
Person(tom,2,64923)
Person(tom,5,64923)
Person(alice,2,97608)
Person(jerry,1,21367)
Person(jerry,3,21367)
Person(tom,1,558)
Person(tom,5,558)
Person(grace,2,72006)
Person(bob,5,29517)
Person(bob,2,46812)
Person(bob,2,98463)
Person(bob,4,46812)
Person(bob,4,98463)
Person(tom,5,43748)
Person(alice,1,8029)
...
Tumbling Window Join
当执行翻滚窗口连接时,具有公共密钥和公共翻滚窗口的所有元素以成对组合的形式连接并传递给JoinFunction或FlatJoinFunction。因为它的行为类似于内连接,所以不会发出一个流的元素,这些元素在其翻滚窗口中没有来自另一个流的元素!
如图所示,我们定义了一个大小为2毫秒的翻滚窗口,这导致了窗体的窗口[0,1], [2,3], …。图像显示了每个窗口中所有元素的成对组合,这些元素将被传递给JoinFunction。请注意,在翻滚窗口中[6,7]没有任何东西被发射,因为绿色流中不存在与橙色元素⑥和⑦连接的元素。
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
...
DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...
orangeStream.join(greenStream)
.where(<KeySelector>)
.equalTo(<KeySelector>)
.window(TumblingEventTimeWindows.of(Time.seconds(2)))
.apply (new JoinFunction<Integer, Integer, String> (){
@Override
public String join(Integer first, Integer second) {
return first + "," + second;
}
});
Sliding Window Join
执行滑动窗口连接时,具有公共键和公共滑动窗口的所有元素都是成对组合并传递给JoinFunction或FlatJoinFunction。不会释放当前滑动窗口中没有来自其他流的元素的一个流的元素!请注意,某些元素可能在一个滑动窗口中连接而在另一个滑动窗口中不连
在这个例子中,我们使用大小为2毫秒的滑动窗口并将它们滑动一毫秒,从而产生滑动窗口[-1, 0],[0,1],[1,2],[2,3], …。x轴下方的连接元素是传递给JoinFunction每个滑动窗口的元素。在这里,您还可以看到橙色②如何与窗口中的绿色③ [2,3]连接,但未与窗口中的任何内容连接[1,2]。
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.windowing.assigners.SlidingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
...
DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...
orangeStream.join(greenStream)
.where(<KeySelector>)
.equalTo(<KeySelector>)
.window(SlidingEventTimeWindows.of(Time.milliseconds(2) /* size */, Time.milliseconds(1) /* slide */))
.apply (new JoinFunction<Integer, Integer, String> (){
@Override
public String join(Integer first, Integer second) {
return first + "," + second;
}
});
Session Window Join
在执行会话窗口连接时,具有相同键的所有元素在“组合”满足会话条件时以成对组合方式连接并传递给JoinFunction或FlatJoinFunction。再次执行内连接,因此如果有一个会话窗口只包含来自一个流的元素,则不会发出任何输出!
里我们定义一个会话窗口连接,其中每个会话除以至少1ms的间隙。有三个会话,在前两个会话中,两个流的连接元素都传递给JoinFunction。在第三阶段,绿色流中没有元素,所以⑧和⑨没有连接!
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.windowing.assigners.EventTimeSessionWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
...
DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...
orangeStream.join(greenStream)
.where(<KeySelector>)
.equalTo(<KeySelector>)
.window(EventTimeSessionWindows.withGap(Time.milliseconds(1)))
.apply (new JoinFunction<Integer, Integer, String> (){
@Override
public String join(Integer first, Integer second) {
return first + "," + second;
}
});
Interval Join
区间连接使用公共密钥连接两个流的元素(我们现在将它们称为A和B),并且流B的元素具有时间戳,该时间戳位于流A中元素的时间戳的相对时间间隔中。这也可以更正式地表达为 b.timestamp ∈ [a.timestamp + lowerBound; a.timestamp + upperBound]或 a.timestamp + lowerBound <= b.timestamp <= a.timestamp + upperBound其中a和b是共享公共密钥的A和B的元素。只要下限总是小于或等于上限,下限和上限都可以是负数或上限。间隔连接当前仅执行内连接。当一对元素传递给ProcessJoinFunction它们时,它们将被赋予ProcessJoinFunction.Context两个元素的更大的时间戳(可以通过它访问)。注意间隔连接当前仅支持事件时间。
在上面的例子中,我们连接两个流’orange’和’green’,下限为-2毫秒,上限为+1毫秒。缺省情况下,这些界限是包容性的,但.lowerBoundExclusive()并.upperBoundExclusive可以应用到改变行为。再次使用更正式的表示法,这将转化为orangeElem.ts + lowerBound <= greenElem.ts <= orangeElem.ts + upperBound如三角形所示。
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.functions.co.ProcessJoinFunction;
import org.apache.flink.streaming.api.windowing.time.Time;
...
DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...
orangeStream
.keyBy(<KeySelector>)
.intervalJoin(greenStream.keyBy(<KeySelector>))
.between(Time.milliseconds(-2), Time.milliseconds(1))
.process (new ProcessJoinFunction<Integer, Integer, String(){
@Override
public void processElement(Integer left, Integer right, Context ctx, Collector<String> out) {
out.collect(first + "," + second);
}
});
从官网的例子可以看出我们其实只是实现了Inner Join ,那么left join呢?下回再说