Hadoop RPC分为四部分

  • 序列化层:将结构化对象在转为字节流以便通过网络传输或写入持久存储。在RPC框架中,主要用于将用户请求中的参数或者应答转化为字节流以便跨机器传输。
  • 函数调用层:定位要调用的函数并执行该函数,Hadoop RPC采用Java反射机制与动态代理实现函数调用。
  • 网络传输层:描述Client与Server之间消息传输的方式,Hadoop RPC采用了基于TCP/IP的Socket机制。
  • 服务器端处理框架:可被抽象为网络I/O模型,描述了客户端与服务器端间信息交互的方式,它的设计直接决定着服务器端的并发处理能力,Hadoop RPC采用了基于Reactor设计模式的事件驱动I/O模型。

Hadoop RPC框架分析

RPC基本概念

RPC通常采用客户机/服务器模型,请求程序是客户机,服务提供程序是服务器,典型的RPC框架主要包括一下几部分:

  • 通信模块:实现请求-应答协议,不会对数据包进行处理。请求-应答协议实现方式又同步和异步,同步模式需要客户端阻塞等待服务器发送的应答到达,异步不用阻塞等待,等服务端主动通知即可。在高并发的场景下,异步可以降低访问延迟和提高带宽利用率。
  • Stub程序:可看为代理程序。使得远程函数调用对用户程序透明。在客户端,将请求信息通过网络模块发送给服务器端,当服务端发送应答后,它会解码对应结果;在服务端,Stub进行解码请求消息中的参数、调用相应的服务过程和编码应答结果的返回值。
  • 调度程序:接收来自通信模块的请求信息,根据其中的标识选择一个Stub程序处理,通常客户端并发请求量比较大时,会采用线程池提高处理效率
  • 客户程序\服务过程:请求的发出者和请求的调用者,如果是单机环境,客户程序可直接通过函数调用访问服务过程,但在分布式环境下,需要考虑网络通信,所以要增加通信模块和Stub程序(保证函数调用的透明性)

一次RPC请求从发送到获取处理结果,经历的步骤:

  1. 客户程序以本地方式调用系统产生的Stub程序
  2. 该Stub程序将函数调用信息按照网络通信模块的要求封装成消息包,并交给通信模块发送到远程服务器端
  3. 远程服务器端接收此消息后,将此消息发送给相应的Stub程序
  4. Stub程序拆封消息,形成被调过程要求的形式,并调用对应的函数
  5. 被调用函数按照所获参数执行,并将结果返回给Stub程序
  6. Stub程序将此结果封装成消息,通过网络通信模块逐级地传送给客户程序。

Hadoop RPC基本框架

Hadoop RPC使用

两个接口

  • public static VersionedProtocol get Proxy/waitForProxy() : 构造一个客户端代理对象,用于向服务器端发送RPC请求。
  • public static Server getServer() : 为某个协议实例构造一个服务器对象,用于处理客户端发送的请求

Hadoop RPC使用方法的步骤

  • 步骤1:定义RPC协议。RPC协议是客户端和服务器端之间的通信接口,定义了服务器端对外提供的服务接口。
/**
 * Superclass of all protocols that use Hadoop RPC.
 * Subclasses of this interface are also supposed to have a static final long versionID field.
 * 子类必须有一个版本号
 */
public interface VersionedProtocol{}

Hadoop中自定义RPC接口都需要继承VersionedProtocol接口,它描述了协议的版本信息。
比如看个ClientProtocol

public interface ClientProtocol extends VersionedProtocol {
  /* 
   *Changing the versionID to 2L since the getTaskCompletionEvents method has
   *changed.
   *Changed to 4 since killTask(String,boolean) is added
   *Version 4: added jobtracker state to ClusterStatus
   *Version 5: max_tasks in ClusterStatus is replaced by
   * max_map_tasks and max_reduce_tasks for HADOOP-1274
   * Version 6: change the counters representation for HADOOP-2248
   * Version 7: added getAllJobs for HADOOP-2487
   * Version 8: change {job|task}id's to use corresponding objects rather that strings.
   * Version 9: change the counter representation for HADOOP-1915
   * Version 10: added getSystemDir for HADOOP-3135
   * Version 11: changed JobProfile to include the queue name for HADOOP-3698
   * Version 12: Added getCleanupTaskReports and 
   *             cleanupProgress to JobStatus as part of HADOOP-3150
   * Version 13: Added getJobQueueInfos and getJobQueueInfo(queue name)
   *             and getAllJobs(queue) as a part of HADOOP-3930
   * Version 14: Added setPriority for HADOOP-4124
   * Version 15: Added KILLED status to JobStatus as part of HADOOP-3924            
   * Version 16: Added getSetupTaskReports and 
   *             setupProgress to JobStatus as part of HADOOP-4261           
   * Version 17: getClusterStatus returns the amount of memory used by 
   *             the server. HADOOP-4435
   * Version 18: Added blacklisted trackers to the ClusterStatus 
   *             for HADOOP-4305
   * Version 19: Modified TaskReport to have TIP status and modified the
   *             method getClusterStatus() to take a boolean argument
   *             for HADOOP-4807
   * Version 20: Modified ClusterStatus to have the tasktracker expiry
   *             interval for HADOOP-4939
   * Version 21: Modified TaskID to be aware of the new TaskTypes                                 
   * Version 22: Added method getQueueAclsForCurrentUser to get queue acls info
   *             for a user
   * Version 23: Modified the JobQueueInfo class to inlucde queue state.
   *             Part of HADOOP-5913.  
   * Version 24: Modified ClusterStatus to include BlackListInfo class which 
   *             encapsulates reasons and report for blacklisted node.          
   * Version 25: Added fields to JobStatus for HADOOP-817.   
   * Version 26: Added properties to JobQueueInfo as part of MAPREDUCE-861.
   *              added new api's getRootQueues and
   *              getChildQueues(String queueName)
   * Version 27: Changed protocol to use new api objects. And the protocol is 
   *             renamed from JobSubmissionProtocol to ClientProtocol.
   * Version 28: Added getJobHistoryDir() as part of MAPREDUCE-975.
   * Version 29: Added reservedSlots, runningTasks and totalJobSubmissions
   *             to ClusterMetrics as part of MAPREDUCE-1048.
   * Version 30: Job submission files are uploaded to a staging area under
   *             user home dir. JobTracker reads the required files from the
   *             staging area using user credentials passed via the rpc.
   * Version 31: Added TokenStorage to submitJob      
   * Version 32: Added delegation tokens (add, renew, cancel)
   * Version 33: Added JobACLs to JobStatus as part of MAPREDUCE-1307
   * Version 34: Modified submitJob to use Credentials instead of TokenStorage.
   * Version 35: Added the method getQueueAdmins(queueName) as part of
   *             MAPREDUCE-1664.
   * Version 36: Added the method getJobTrackerStatus() as part of
   *             MAPREDUCE-2337.
   * Version 37: More efficient serialization format for framework counters
   *             (MAPREDUCE-901)
   * Version 38: Added getLogFilePath(JobID, TaskAttemptID) as part of 
   *             MAPREDUCE-3146
   */
  public static final long versionID = 37L;
  • 步骤2:实现RPC协议。Hadoop RPC协议通常是一个Java接口,用户需要实现该接口。
    比如Hadoop中实现了ClientProtocol的实例:
  • hadoop rpc 实现细节 hadoop cp -r_大数据

  • 步骤3:构造并启动RPC Server。直接使用静态方法getServer()构造一个RPC Server,并调用函数start() 启动该Server。
server = RPC.getServer(new ClientProtocolImpl(), severHost, serverPort , numHandlers , false , conf);
server.start();

其中serverHost和serverPort分别标识服务器的host和监听端口号,numHandlers表示服务器端处理请求的线程数目,到这为止,服务器处理监听状态,等待客户端请求到达。

  • 步骤4:狗仔RPC Client,并发送RPC请求。使用静态方法getProxy()构造客户端代理对象,直接通过代理对象调用远程端的方法

待更
这部分学得比较模糊,等有具体应用场景再回来看~