flink和flinkcdc的区别 flink orc

转载

mob64ca140eb362 2024-05-04 09:30:40

文章标签 flink和flinkcdc的区别消息传递 Endpoint 发送消息 文章分类 架构后端开发

底层RPC框架基于Akka实现

Akka介绍

Akka是一个开发并发、容错和可伸缩应用的框架。它是Actor Model的一个实现，和Erlang的并发模型很像。在Actor模型中，所有的实体被认为是独立的actors。actors和其他actors通过发送异步消息通信。Actor模型的强大来自于异步。它也可以显式等待响应，这使得可以执行同步操作。但是，强烈不建议同步消息，因为它们限制了系统的伸缩性。每个actor有一个邮箱(mailbox)，它收到的消息存储在里面。另外，每一个actor维护自身单独的状态。
每个actor是一个单一的线程，它不断地从其邮箱中poll(拉取)消息，并且连续不断地处理。对于已经处理过的消息的结果，actor可以改变它自身的内部状态或者发送一个新消息或者孵化一个新的actor。尽管单个的actor是自然有序的，但一个包含若干个actor的系统却是高度并发的并且极具扩展性的。因为那些处理线程是所有actor之间共享的。这也是我们为什么不该在actor线程里调用可能导致阻塞的“调用”。因为这样的调用可能会阻塞该线程使得他们无法替其他actor处理消息。
++++++++++++++++++++++++++++++分割线+++++++++++++++++++++++++++++++++++++++++++++
计算机CPU的计算速度（频率）的提高是有限的，剩下能做的是放入多个计算核心以提升性能。为了利用多核心的性能，需要并发执行。但多线程的方式往往会引入很多问题，同时直接增加了调试难度。

为什么Actor模型是一种处理并发问题的解决方案呢？

处理并发问题一贯的思路是如何保证共享数据的一致性和正确性。

一般而言，有两种策略用来在并发线程中进行通信：共享数据、消息传递

使用共享数据的并发编程面临的最大问题是数据条件竞争data race，处理各种锁的问题是让人十分头疼的。和共享数据方式相比，消息传递机制最大的优势在于不会产生数据竞争状态。而实现消息传递有两种常见类型：基于channel的消息传递、基于Actor的消息传递。
不同的是Actor的状态不能直接读取和修改，方法也不能直接调用。Actor只能通过消息传递的方式与外界通信。每个参与者存在一个代表本身的地址，但只能向该地址发送消息。而且消息传递是异步的。每个Actor都有一个邮箱，邮箱接收并缓存其他Actor发过来的消息，通过邮箱队列mail queue来处理消息。Actor一次只能同步处理一个消息，处理消息过程中，除了可以接收消息外不能做任何其他操作。
每个Actor是完全独立的，可以同时执行他们的操作。每个Actor是一个计算实体，映射接收到的消息并执行以下动作：发送有限个消息给其他Actor、创建有限个新的Actor、为下一个接收的消息指定行为。这三个动作没有固定的顺序，可以并发地执行，Actor会根据接收到的消息进行不同的处理。

在Actor系统中包含一个未处理的任务集，每个任务都由三个属性标识：

tag用以区分系统中的其他任务
target 通信到达的地址
communication 包含在target目标地址上的Actor，处理任务时可获取的信息。
为简单起见，可见一个任务视为一个消息，在Actor之间传递包含以上三个属性的值的消息。

Actor模型有两种任务调度方式：基于线程的调度、基于事件的调度

基于线程的调度
为每个Actor分配一个线程，在接收一个消息时，如果当前Actor的邮箱为空则会阻塞当前线程。基于线程的调度实现较为简单，但线程数量受到操作的限制，现在的Actor模型一般不采用这种方式。
基于事件的调度
事件可以理解为任务或消息的到来，而此时才会为Actor的任务分配线程并执行。
因此，可以把系统中所有事物都抽象成为一个Actor：

Actor的输入是接收到的消息
Actor接收到消息后处理消息中定义的任务
Actor处理完成任务后可以发送消息给其它Actor
在一个系统中可以将一个大规模的任务分解为一些小任务，这些小任务可以由多个Actor并发处理，从而减少任务的完成时间。

Actor模型的另一个好处是可以消除共享状态，因为Actor每次只能处理一条消息，所以Actor内部可以安全的处理状态，而不用考虑锁机制。
Actor是由状态（state）、行为（behavior）、邮箱（mailbox）三者组成的。

状态（state）：状态是指actor对象的变量信息，状态由actor自身管理，避免并发环境下的锁和内存原子性等问题。
行为（behavior）：行为指定的是actor中计算逻辑，通过actor接收到的消息来改变actor的状态。
邮箱（mailbox）：邮箱是actor之间的通信桥梁，邮箱内部通过FIFO消息队列来存储发送发消息，而接收方则从邮箱中获取消息。
尽管多个actors同时运行，但是一个actor只能顺序地处理消息。也就是说其它actor发送多条消息给一个actor时，这个actor只能一次处理一条。如果需要并行的处理多条消息时，需要将消息发送给多个actor。

创建Akka系统

Akka系统的核心ActorSystem和Actor，若需构建一个Akka系统，首先需要创建ActorSystem，创建完ActorSystem后，可通过其创建Actor（注意：Akka不允许直接new一个Actor，只能通过 Akka 提供的某些 API 才能创建或查找 Actor，一般会通过 ActorSystem#actorOf和ActorContext#actorOf来创建 Actor），另外，我们只能通过ActorRef（Actor的引用，其对原生的 Actor 实例做了良好的封装，外界不能随意修改其内部状态）来与Actor进行通信。
与Actor通信

tell方式
当使用tell方式时，表示仅仅使用异步方式给某个Actor发送消息，无需等待Actor的响应结果，并且也不会阻塞后续代码的运行。
ask方式
当我们需要从Actor获取响应结果时，可使用ask方法，ask方法会将返回结果包装在scala.concurrent.Future中，然后通过异步回调获取返回结果。
RpcGateway
Flink的RPC协议通过RpcGateway来定义；由前面可知，若想与远端Actor通信，则必须提供地址（ip和port），如在Flink-on-Yarn模式下，JobMaster会先启动ActorSystem，此时TaskExecutor的Container还未分配，后面与TaskExecutor通信时，必须让其提供对应地址。
RpcEndpoint
每个RpcEndpoint对应了一个路径（endpointId和actorSystem共同确定），每个路径对应一个Actor，其实现了RpcGateway接口。在RpcEndpoint中还定义了一些方法如runAsync(Runnable)、callAsync(Callable, Time)方法来执行Rpc调用。
RpcService
Rpc服务的接口，其主要作用如下：

根据提供的RpcEndpoint来启动RpcServer（Actor）；
根据提供的地址连接到RpcServer，并返回一个RpcGateway；
延迟/立刻调度Runnable、Callable；
停止RpcServer（Actor）或自身服务；
在Flink中其实现类为AkkaRpcService。
AkkaRpcService
AkkaRpcService中封装了ActorSystem，并保存了ActorRef到RpcEndpoint的映射关系，在构造RpcEndpoint时会启动指定rpcEndpoint上的RpcServer，其会根据Endpoint类型（FencedRpcEndpoint或其他）来创建不同的Actor（FencedAkkaRpcActor或AkkaRpcActor），并将RpcEndpoint和Actor对应的ActorRef保存起来，然后使用动态代理创建RpcServer。

当启动RpcServer后，即创建了相应的Actor（注意此时Actor的处于停止状态）和动态代理对象，需要调用RpcEndpoint#start启动启动Actor，此时启动RpcEndpoint流程如下（以非FencedRpcEndpoint为例）：

1调用RpcEndpoint#start；
2委托给RpcServer#start；
3调用动态代理的AkkaInvocationHandler#invoke；发现调用的是StartStoppable#start方法，则直接进行本地方法调用
4调用AkkaInvocationHandler#start；
5通过ActorRef#tell给对应的Actor发送消息rpcEndpoint.tell(ControlMessages.START, ActorRef.noSender());；
6调用AkkaRpcActor#handleControlMessage处理控制类型消息；
7在主线程中将自身状态变更为Started状态；

执行代码
与Actor通信，通过调用runSync/callSync等方法其直接执行代码。
*AkkaInvocationHandler#invoke -> AkkaInvocation#scheduleRunAsync；

*AkkaRpcActor#handleMessage -> AkkaRpcActor#handleRpcMessage，其中handleRpcMessage方法如下

protected void handleRpcMessage(Object message) {
		if (message instanceof RunAsync) {
			handleRunAsync((RunAsync) message);
		} else if (message instanceof CallAsync) {
			handleCallAsync((CallAsync) message);
		} else if (message instanceof RpcInvocation) {
			handleRpcInvocation((RpcInvocation) message);
		} else {
			log.warn(
				"Received message of unknown type {} with value {}. Dropping this message!",
				message.getClass().getName(),
				message);

			sendErrorIfSender(new AkkaUnknownMessageException("Received unknown message " + message +
				" of type " + message.getClass().getSimpleName() + '.'));
		}
	}

kkaRpcActor#handleRunAsync，其代码如下：

private void handleRunAsync(RunAsync runAsync) {
	final long timeToRun = runAsync.getTimeNanos();
	final long delayNanos;

	if (timeToRun == 0 || (delayNanos = timeToRun - System.nanoTime()) <= 0) {
		// run immediately
		try {
			runAsync.getRunnable().run();
		} catch (Throwable t) {
			log.error("Caught exception while executing runnable in main thread.", t);
			ExceptionUtils.rethrowIfFatalErrorOrOOM(t);
		}
	}
	else {
		// schedule for later. send a new message after the delay, which will then be immediately executed
		FiniteDuration delay = new FiniteDuration(delayNanos, TimeUnit.NANOSECONDS);
		RunAsync message = new RunAsync(runAsync.getRunnable(), timeToRun);

		final Object envelopedSelfMessage = envelopeSelfMessage(message);

		getContext().system().scheduler().scheduleOnce(delay, getSelf(), envelopedSelfMessage,
				getContext().dispatcher(), ActorRef.noSender());
	}
}

当还未到调度时间时，该Actor会延迟一段时间后再次给自己发送消息。
当调用非AkkaInvocationHandler实现的方法时，则进行Rpc请求。
下面分析处理Rpc调用的流程。

AkkaInvocationHandler#invokeRpc，其方法如下：

private Object invokeRpc(Method method, Object[] args) throws Exception {
        // 获取方法相应的信息
      String methodName = method.getName();
      Class<?>[] parameterTypes = method.getParameterTypes();
      Annotation[][] parameterAnnotations = method.getParameterAnnotations();
      Time futureTimeout = extractRpcTimeout(parameterAnnotations, args, timeout);
 
        // 创建RpcInvocationMessage(可分为LocalRpcInvocation/RemoteRpcInvocation)

final RpcInvocation rpcInvocation = createRpcInvocationMessage(methodName, parameterTypes, args);

  Class<?> returnType = method.getReturnType();

  final Object result;

    // 无返回，则使用tell方法
  if (Objects.equals(returnType, Void.TYPE)) {
      tell(rpcInvocation);

      result = null;
  } else {
      // execute an asynchronous call
        // 有返回，则使用ask方法
      CompletableFuture<?> resultFuture = ask(rpcInvocation, futureTimeout);

      CompletableFuture<?> completableFuture = resultFuture.thenApply((Object o) -> {
            // 调用返回后进行反序列化
          if (o instanceof SerializedValue) {
              try {
                  return  ((SerializedValue<?>) o).deserializeValue(getClass().getClassLoader());
              } catch (IOException | ClassNotFoundException e) {
                  throw new CompletionException(
                      new RpcException("Could not deserialize the serialized payload of RPC method : "
                          + methodName, e));
              }
          } else {
                // 直接返回
              return o;
          }
      });

        // 若返回类型为CompletableFuture则直接赋值
      if (Objects.equals(returnType, CompletableFuture.class)) {
          result = completableFuture;
      } else {
          try {
                // 从CompletableFuture获取
              result = completableFuture.get(futureTimeout.getSize(), futureTimeout.getUnit());
          } catch (ExecutionException ee) {
              throw new RpcException("Failure while obtaining synchronous RPC result.", ExceptionUtils.stripExecutionException(ee));
          }
      }
  }

  return result;

}

AkkaRpcActor#handleRpcInvocation，其代码如下

private void handleRpcInvocation(RpcInvocation rpcInvocation) {
      Method rpcMethod = null;
 
      try {
            // 获取方法的信息
          String methodName = rpcInvocation.getMethodName();
          Class<?>[] parameterTypes = rpcInvocation.getParameterTypes();
 
          // 在RpcEndpoint中找指定方法
          rpcMethod = lookupRpcMethod(methodName, parameterTypes);
      } catch (ClassNotFoundException e) {
          log.error("Could not load method arguments.", e);
 
            // 异常处理
          RpcConnectionException rpcException = new RpcConnectionException("Could not load method arguments.", e);
          getSender().tell(new Status.Failure(rpcException), getSelf());
      } catch (IOException e) {
          log.error("Could not deserialize rpc invocation message.", e);
          // 异常处理
          RpcConnectionException rpcException = new RpcConnectionException("Could not deserialize rpc invocation message.", e);
          getSender().tell(new Status.Failure(rpcException), getSelf());
      } catch (final NoSuchMethodException e) {
          log.error("Could not find rpc method for rpc invocation.", e);
          // 异常处理
          RpcConnectionException rpcException = new RpcConnectionException("Could not find rpc method for rpc invocation.", e);
          getSender().tell(new Status.Failure(rpcException), getSelf());
      }
 
      if (rpcMethod != null) {
          try {
              // this supports declaration of anonymous classes
              rpcMethod.setAccessible(true);
 
                // 返回类型为空则直接进行invoke
              if (rpcMethod.getReturnType().equals(Void.TYPE)) {
                  // No return value to send back
                  rpcMethod.invoke(rpcEndpoint, rpcInvocation.getArgs());
              }
              else {
                  final Object result;
                  try {
                      result = rpcMethod.invoke(rpcEndpoint, rpcInvocation.getArgs());
                  }
                  catch (InvocationTargetException e) {
                      log.debug("Reporting back error thrown in remote procedure {}", rpcMethod, e);
 
                      // tell the sender about the failure
                      getSender().tell(new Status.Failure(e.getTargetException()), getSelf());
                      return;
                  }
 
                  final String methodName = rpcMethod.getName();
 
                    // 方法返回类型为CompletableFuture
                  if (result instanceof CompletableFuture) {
                      final CompletableFuture<?> responseFuture = (CompletableFuture<?>) result;
                        // 发送结果（使用Patterns发送结果给调用者，并会进行序列化并验证结果大小）
                      sendAsyncResponse(responseFuture, methodName);
                  } else {
                        // 类型非CompletableFuture，发送结果（使用Patterns发送结果给调用者，并会进行序列化并验证结果大小）
                      sendSyncResponse(result, methodName);
                  }
              }
          } catch (Throwable e) {
              log.error("Error while executing remote procedure call {}.", rpcMethod, e);
              // tell the sender about the failure
              getSender().tell(new Status.Failure(e), getSelf());
          }
      }
  }

将结果返回给调用者AkkaInvocationHandler#ask；
++++++++++++++++++++++分割线++++++++++++++++++++++++++++++++++++++++

经过上述步骤就完成Rpc（本地/远程）调用，可以看到底层也是通过Akka提供的tell/ask方法进行通信；

Flink老版本处理Rpc时，各节点通过继承FlinkActor接口，接收Actor消息，根据消息类型进行不同的业务处理。此种方式将流程业务和具体通信组件耦合在一起，不利于后期更换通信组件(如使用netty)，因此Flink引入了RPC调用，各节点通过GateWay方式回调，隐藏通信组件的细节，实现解耦。

RPC相关的主要接口
RpcEndpoint
RpcService
RpcGateway
RpcEndpoint：远程过程调用(remote procedure calls) 的基类
RpcEndpoint是Flink RPC终端的基类，所有提供远程过程调用的分布式组件必须扩展RpcEndpoint， RpcEndpoint功能由RpcService支持。

RpcEndpoint的子类只有四类组件：Dispatcher，JobMaster，ResourceManager，TaskExecutor，即Flink中只有这四个组件有RPC的能力，换句话说只有这四个组件有RPC的这个需求。

这也对应了Flink这的四大组件：Dispatcher，JobMaster，ResourceManager，TaskExecutor，彼此之间的通信需要依赖RPC实现
RpcGateway：RPC调用的网关
RpcGateway主要实现接口有：FencedRpcEndpoint和TaskExecutorGateway，而这两个接口又分别被Flink四大组件继承，即Dispatcher，JobMaster，ResourceManager，TaskExecutor可通过各自的Gateway实现RPC调用。

Rpc gateway interface，所有Rpc组件的网关，定义了各组件的Rpc接口
常见的就是Rpc实现，如JobMasterGateway，DispatcherGateway，ResourceManagerGateway，TaskExecutorGateway等
各组件类的成员变量都有需要通信的其他组件的GateWay实现类，便于Rpc调用

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。