一.简介

在流式处理的过程中, 在中间步骤的处理中, 如果涉及到一些费事的操作或者是外部系统的数据交互, 那么就会给整个流造成一定的延迟。在 Flink 的 1.2 版本中引入了 Asynchronous I/O,能够支持异步的操作,以提高 flink 系统与外部数据系统交互的性能及吞吐量。
Flink Asynchronous IO异步操作_异步请求

图片来源官网

图中棕色的长条表示等待时间,可以发现网络等待时间阻碍了吞吐和延迟,为了解决同步访问的问题,异步模式可以并发地处理多个请求和回复,也就是说,你可以连续地向数据库发送用户a、b、c等请求,与此同时,哪个请求的回复先返回了就处理哪个回复,从而连续的请求之间不需要阻塞等待,如上图右边所示。这也正是 Async I/O 的实现原理。

二.原理

2.1 API

/**
 * An implementation of the 'AsyncFunction' that sends requests and sets the callback.
 */
class AsyncDatabaseRequest extends AsyncFunction[String, (String, String)] {
    /** The database specific client that can issue concurrent requests with callbacks */
    lazy val client: DatabaseClient = new DatabaseClient(host, post, credentials)
    /** The context used for the future callbacks */
    implicit lazy val executor: ExecutionContext = ExecutionContext.fromExecutor(Executors.directExecutor())
    override def asyncInvoke(str: String, resultFuture: ResultFuture[(String, String)]): Unit = {
        // issue the asynchronous request, receive a future for the result
        // 发起异步请求,返回结果是一个 Future
        val resultFutureRequested: Future[String] = client.query(str)
        // set the callback to be executed once the request by the client is complete
        // the callback simply forwards the result to the result future
        // 请求完成时的回调,将结果交给 ResultFuture
        resultFutureRequested.onSuccess {
            case result: String => resultFuture.complete(Iterable((str, result)))
        }
    }
}
// create the original stream
val stream: DataStream[String] = ...
// 应用 async I/O 转换,设置等待模式、超时时间、以及进行中的异步请求的最大数量
val resultStream: DataStream[(String, String)] =
    AsyncDataStream.unorderedWait(stream, new AsyncDatabaseRequest(), 1000, TimeUnit.MILLISECONDS, 100)

AsyncDataStream 提供了两种调用方法,分别是 orderedWait 和 unorderedWait,这分别对应了有序和无序两种输出模式。
之所以会提供两种输出模式,是因为异步请求的完成时间是不确定的,先发出的请求的完成时间可能会晚于后发出的请求。在“有序”的输出模式下,所有计算结果的提交完全和消息的到达顺序一致;而在“无序”的输出模式下,计算结果的提交则是和请求的完成顺序相关的,先处理完成的请求的计算结果会先提交。

注意

在使用“事件时间”的情况下,“无序”输出模式仍然可以保证 watermark 的正常处理,即在两个 watermark 之间的消息的异步请求结果可能是异步提交的,但在 watermark 之后的消息不能先于该 watermark 之前的消息提交。

由于异步请求的完成时间不确定,需要设置请求的超时时间,并配置同时进行中的异步请求的最大数量。

AsyncDataStream.orderedWait

def orderedWait[IN, OUT: TypeInformation](
    input: DataStream[IN],
    asyncFunction: AsyncFunction[IN, OUT],
    timeout: Long, //超时时间
    timeUnit: TimeUnit,
    capacity: Int) //异步请求最大数量
  : DataStream[OUT] = {
  val javaAsyncFunction = wrapAsJavaAsyncFunction(asyncFunction)
  val outType : TypeInformation[OUT] = implicitly[TypeInformation[OUT]]
  asScalaStream(JavaAsyncDataStream.orderedWait[IN, OUT](
    input.javaStream,
    javaAsyncFunction,
    timeout,
    timeUnit,
    capacity).returns(outType))
}

AsyncDataStream.unorderedWait

def unorderedWait[IN, OUT: TypeInformation](
    input: DataStream[IN],
    asyncFunction: AsyncFunction[IN, OUT],
    timeout: Long,//超时时间
    timeUnit: TimeUnit,
    capacity: Int)//异步请求最大数量
  : DataStream[OUT] = {
  val javaAsyncFunction = wrapAsJavaAsyncFunction(asyncFunction)
  val outType : TypeInformation[OUT] = implicitly[TypeInformation[OUT]]
  asScalaStream(JavaAsyncDataStream.unorderedWait[IN, OUT](
    input.javaStream,
    javaAsyncFunction,
    timeout,
    timeUnit,
    capacity).returns(outType))
}

2.2 实现

AsyncDataStream 在运行时被转换为 AsyncWaitOperator 算子,它是 AbstractUdfStreamOperator 的子类。下面我们来看看 AsyncWaitOperator 的实现原理。AsyncWaitOperator 采用 StreamElementQueue 来是实现消息的顺序性保证。有两个子类:OrderedStreamElementQueue 和 UnorderedStreamElementQueue。

基本原理

AsyncWaitOperator 算子相比于其它算子的最大不同在于,它的输入和输出并不是同步的。因此,在 AsyncWaitOperator 内部采用了一种 “生产者-消费者” 模型,基于一个队列解耦异步计算和计算结果的提交。StreamElementQueue 提供了一种队列的抽象,一个“消费者”线程 Emitter 从中取出已完成的计算结果,并提交给下游算子,而异步请求则充当了队列“生产者”的角色。基本的处理逻辑如下图所示。
Flink Asynchronous IO异步操作_flink_02

public class AsyncWaitOperator<IN, OUT>
      extends AbstractUdfStreamOperator<OUT, AsyncFunction<IN, OUT>>
      implements OneInputStreamOperator<IN, OUT> {
   private static final long serialVersionUID = 1L;
   private static final String STATE_NAME = "_async_wait_operator_state_";
   //队列最大容量
   private final int capacity;
   //选择模式:有序和无序
   private final AsyncDataStream.OutputMode outputMode;
   //超时时间
   private final long timeout;
   //snapshots 类型
   private transient StreamElementSerializer<IN> inStreamElementSerializer;
   /** Recovered input stream elements. */
   private transient ListState<StreamElement> recoveredStreamElements;
   //输出队列
   private transient StreamElementQueue<OUT> queue;
   /** Mailbox executor used to yield while waiting for buffers to empty. */
   private final transient MailboxExecutor mailboxExecutor;
   private transient TimestampedCollector<OUT> timestampedCollector;
   public AsyncWaitOperator(
         @Nonnull AsyncFunction<IN, OUT> asyncFunction,
         long timeout,
         int capacity,
         @Nonnull AsyncDataStream.OutputMode outputMode,
         @Nonnull MailboxExecutor mailboxExecutor) {
      super(asyncFunction);
      // TODO this is a temporary fix for the problems described under FLINK-13063 at the cost of breaking chains for
      //  AsyncOperators.
      setChainingStrategy(ChainingStrategy.HEAD);
      Preconditions.checkArgument(capacity > 0, "The number of concurrent async operation should be greater than 0.");
      this.capacity = capacity;
      this.outputMode = Preconditions.checkNotNull(outputMode, "outputMode");
      this.timeout = timeout;
      this.mailboxExecutor = mailboxExecutor;
   }
   @Override
   public void setup(StreamTask<?, ?> containingTask, StreamConfig config, Output<StreamRecord<OUT>> output) {
      super.setup(containingTask, config, output);
      this.inStreamElementSerializer = new StreamElementSerializer<>(
         getOperatorConfig().<IN>getTypeSerializerIn1(getUserCodeClassloader()));
       //选择有序和无序
       //AsyncWaitOperator 采用 StreamElementQueue 来是实现消息的顺序性保证。有两个子类:OrderedStreamElementQueue 和 UnorderedStreamElementQueue。
      switch (outputMode) {
         case ORDERED:
            queue = new OrderedStreamElementQueue<>(capacity);
            break;
         case UNORDERED:
            queue = new UnorderedStreamElementQueue<>(capacity);
            break;
         default:
            throw new IllegalStateException("Unknown async mode: " + outputMode + '.');
      }
      this.timestampedCollector = new TimestampedCollector<>(output);
   }
   @Override
   public void open() throws Exception {
      super.open();
      if (recoveredStreamElements != null) {
         for (StreamElement element : recoveredStreamElements.get()) {
            if (element.isRecord()) {
               processElement(element.<IN>asRecord());
            }
            else if (element.isWatermark()) {
               processWatermark(element.asWatermark());
            }
            else if (element.isLatencyMarker()) {
               processLatencyMarker(element.asLatencyMarker());
            }
            else {
               throw new IllegalStateException("Unknown record type " + element.getClass() +
                  " encountered while opening the operator.");
            }
         }
         recoveredStreamElements = null;
      }
   }
   @Override
   public void processElement(StreamRecord<IN> element) throws Exception {
      // 将元素加入到队列中
      final ResultFuture<OUT> entry = addToWorkQueue(element);
      //当异步IO之后完毕后,会调用resultHandler.complete() 方法,将结果收集到resutHandler中
      final ResultHandler resultHandler = new ResultHandler(element, entry);
     // 注册定时器
      if (timeout > 0L) {
         final long timeoutTimestamp = timeout + getProcessingTimeService().getCurrentProcessingTime();
         final ScheduledFuture<?> timeoutTimer = getProcessingTimeService().registerTimer(
            timeoutTimestamp,
            timestamp -> userFunction.timeout(element.getValue(), resultHandler));
         resultHandler.setTimeoutTimer(timeoutTimer);
      }
      // 异步IO 调用
      userFunction.asyncInvoke(element.getValue(), resultHandler);
   }
   @Override
   public void processWatermark(Watermark mark) throws Exception {
      addToWorkQueue(mark);
      // watermarks are always completed
      // if there is no prior element, we can directly emit them
      // this also avoids watermarks being held back until the next element has been processed
      outputCompletedElement();
   }
   @Override
   public void snapshotState(StateSnapshotContext context) throws Exception {
      super.snapshotState(context);
      ListState<StreamElement> partitionableState =
         getOperatorStateBackend().getListState(new ListStateDescriptor<>(STATE_NAME, inStreamElementSerializer));
      partitionableState.clear();
      try {
         partitionableState.addAll(queue.values());
      } catch (Exception e) {
         partitionableState.clear();
         throw new Exception("Could not add stream element queue entries to operator state " +
            "backend of operator " + getOperatorName() + '.', e);
      }
   }
   @Override
   public void initializeState(StateInitializationContext context) throws Exception {
      super.initializeState(context);
      recoveredStreamElements = context
         .getOperatorStateStore()
         .getListState(new ListStateDescriptor<>(STATE_NAME, inStreamElementSerializer));
   }
   @Override
   public void close() throws Exception {
      try {
         waitInFlightInputsFinished();
      }
      finally {
         super.close();
      }
   }
   
// ResultHandler 的complete 方法
@Override
public void complete(Collection<OUT> results) {
	Preconditions.checkNotNull(results, "Results must not be null, use empty collection to emit nothing");
	// 互斥条件
	if (!completed.compareAndSet(false, true)) {
		return;
	}
	//将结果发送给下一个处理节点
	processInMailbox(results);
}
private void processInMailbox(Collection<OUT> results) {
	// mail box thread 中进行消息发送,processResults() 进行消息处理
	mailboxExecutor.execute(
		() -> processResults(results),
		"Result in AsyncWaitOperator of input %s", results);
}
private void processResults(Collection<OUT> results) {
	// 计算出了结果,取消定时器
	if (timeoutTimer != null) {
		// canceling in mailbox thread avoids https://issues.apache.org/jira/browse/FLINK-13635
		timeoutTimer.cancel(true);
	}
	// 更新Queue的Entry
	resultFuture.complete(results);
	// 从Queue中输出所有查询出来的结果
	outputCompletedElement();
}
// 将结果发送出去
private void outputCompletedElement() {
	if (queue.hasCompletedElements()) {
		// emit only one element to not block the mailbox thread unnecessarily
		queue.emitCompletedElement(timestampedCollector);
		// if there are more completed elements, emit them with subsequent mails
		if (queue.hasCompletedElements()) {
			mailboxExecutor.execute(this::outputCompletedElement, "AsyncWaitOperator#outputCompletedElement");
		}
	}
}

上述最后一个函数的resultFuture.compete() 会更新Queue中的Entry。然后将队列中已经完成的元素给发送出去。

有序 OrderedStreamElementQueue

OrderedStreamElementQueue 实现了有序,内部数据结构是Java集合的Queue。当且当队列头的元素已经完成时,才会将元素发送。

@Override
public boolean hasCompletedElements() {
    // 队列首的元素已经完成,可以发送
   return !queue.isEmpty() && queue.peek().isDone();
}
// 发送元素
@Override
public void emitCompletedElement(TimestampedCollector<OUT> output) {
    // 判断队首元素是否可以发送
   if (hasCompletedElements()) {
      final StreamElementQueueEntry<OUT> head = queue.poll();
      head.emitResult(output);
   }
}

无序

UnorderedStreamElementQueue 实现无序发送,使用一套逻辑实现了ProcessingTime无序 和 EventTime 无序。

无序处理指的是消息流入operator的顺序与经过处理后流入下一级operator的顺序无必然关联。

  • 在processingTime模式下:应用对消息的顺序不敏感,因此可以实现严格意义的无序处理。
  • 在EventTime时间模式下:应用对消息顺序敏感,消息的顺序对应用的统计结果影响较大,应用定期生成watermark并在task/operator间流动,在两个watermark之间的消息其消息无序不会对应用结果产生负面影响,如果一个watermark前后的消息发送到下游时,与接收到消息的顺序不一致,那么很有可能导致统计结果异常。因此该模式下的无序处理主要是指watermark之间的消息处理是无序的,而同一watermark两侧的消息必须遵循watermark前的消息早于watermark发送至下游,而watermark后的消息晚于watermark发送至下游。
static class Segment<OUT> {
   /** Unfinished input elements. */
   private final Set<StreamElementQueueEntry<OUT>> incompleteElements;
   /** Undrained finished elements. */
   private final Queue<StreamElementQueueEntry<OUT>> completedElements;
}
public final class UnorderedStreamElementQueue<OUT> implements StreamElementQueue<OUT> {
   private static final Logger LOG = LoggerFactory.getLogger(UnorderedStreamElementQueue.class);
   /** Capacity of this queue. */
   private final int capacity;
   /** Queue of queue entries segmented by watermarks. */
   private final Deque<Segment<OUT>> segments;
   
   // 取出Segments 的首个元素判断是否是完成的。
   @Override
    public boolean hasCompletedElements() {
       return !this.segments.isEmpty() && this.segments.getFirst().hasCompleted();
    }
}

Segment 就是一个队列,在UnorderedStreamElementQueue 中在外面又封装了一层队列。
双端队列用来解决ProcessingTime 和 EventTime 的无序。

ProcessingTime无序:segments 中永远只有一个 元素,所以将所有元素放在一个队列中。

EventTime 无序:每次放入watermark 时,在segments 队列中放入一个空的 Segment。后续的元素添加都会是另外一个队列。这样就保证了Watermark 之间的元素无序。

容错

@Override
public void snapshotState(StateSnapshotContext context) throws Exception {
   super.snapshotState(context);
   ListState<StreamElement> partitionableState =
      getOperatorStateBackend().getListState(new ListStateDescriptor<>(STATE_NAME, inStreamElementSerializer));
   partitionableState.clear();
   try {
      // 将队列中的元素保存在状态中即可。
      partitionableState.addAll(queue.values());
   } catch (Exception e) {
      partitionableState.clear();
      throw new Exception("Could not add stream element queue entries to operator state " +
         "backend of operator " + getOperatorName() + '.', e);
   }
}

在snapShot 函数中,保存了状态的信息,这是状态一致性的基础。
AsyncWaitOperator 执行快照非常简单。从代码中可以看到执行了如下步骤:

  • 先清空原先的状态存储。
  • 将Queue中的信息全部取出,然后放入到状态存储区中。
  • 执行快照。

三.示例

object AsyncIOExample {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(10)
    import org.apache.flink.api.scala._
    val  inputStream = env.addSource(new CustomNonParallelSourceFunction)
    
    val result1 = AsyncDataStream.orderedWait(inputStream,new SampleAsyncFunction,1000,TimeUnit.MILLISECONDS,20)
    val result2 = AsyncDataStream.unorderedWait(inputStream,new SampleAsyncFunction,1000,TimeUnit.MILLISECONDS,20)
    result1.print("result1")
    result2.print("result2")
    env.execute("AsyncIOExample")
  }
  class CustomNonParallelSourceFunction extends SourceFunction[Long] {
    var count = 0L
    var isRunning = true
    override def run(sourceContext: SourceFunction.SourceContext[Long]): Unit = {
      while (isRunning){
        sourceContext.collect(count)
        count +=1
        Thread.sleep(1000)
      }
    }
    override def cancel(): Unit = {
      isRunning = false
    }
  }
  val executorService:ExecutorService = Executors.newFixedThreadPool(30)
  class SampleAsyncFunction extends RichAsyncFunction[Long,String] {
    val failRatio = 0.001f
    val sleepFactor = 1000L
    val shutdownWaitTS = 20000L
    override def open(parameters: Configuration): Unit = {
      super.open(parameters)
    }
    override def close(): Unit = {
      super.close()
      ExecutorUtils.gracefulShutdown(shutdownWaitTS, TimeUnit.MILLISECONDS, executorService)
    }
    override def asyncInvoke(input: Long, resultFuture: ResultFuture[String]): Unit = {
      executorService.submit(new Runnable {
        override def run(): Unit = {
          val sleep = (ThreadLocalRandom.current().nextFloat() * sleepFactor).toLong
          try {
            Thread.sleep(sleep)
            if(ThreadLocalRandom.current().nextFloat() < failRatio){
              resultFuture.completeExceptionally(new Exception("lilili"))
            }else resultFuture.complete(List("key-" + input))
          }catch {
            case e:Exception=>{
              resultFuture.complete(List())
              e.printStackTrace()
            }
          }
        }
      })
    }
  }
}

参考

http://wuchong.me/blog/2017/05/17/flink-internals-async-io/

公众号

Flink Asynchronous IO异步操作_flink_03

微信号:bigdata_limeng