一、DataTransferProtocol
DataTransferProtocol是用来写入或读出DataNode(简称DN)上的数据的流式接口,定义了如下关于数据传输的方法:
readBlock():从当前的DN上读出数据块。
writeBlock():将当前DN上的数据块写入pipeline。
transferBlock():将当前DN上的数据块复制到另一个DN上。用于数据块异常时,复制出新的数据块。
copyBlock():复制当前DN上的数据块。用于平衡和迁移。
replaceBlock():将源数据块迁移到另一个DN上,并删除源数据块。用于平衡(balancing,保持集群存储容量负载均衡)和迁移(将不满足存储策略的文件移动到对应的DN上)。
blockChecksum():获取数据块的checksum(校验值)。
blockGroupChecksum():获取striped数据块的checksum(校验值)。用于EC数据块的校验。
requestShortCIrcuitShm():获取保存短路读取数据块的共享内存。
requestShortCircuitFds():获取保存短路读取的数据块的文件描述符。
releaseShortCircuitFds():释放一个短路读取数据块的文件描述符。
二、DataTransferProtocol中的方法调用流程
DataTransferProtocol有两个子类——Sender和Receiver。
其中Sender类封装了DataTransferProtocol的调用操作, 用于发起流式接口请求;
Receiver类封装了DataTransferProtocol的执行操作, 用于响应流式接口请求。
在DFSClient发起如readBlock()请求时,DFSClient会调用Sender将请求序列化后发送给Receiver。Receiver将其反序列化后执行读取操作。如下图:
注意,这里与RPC调用不同。RPC调用目的是调用远程提供的服务,需要的是返回的处理结果;而Sender的请求目的是唤起远程DN进行对应的操作,需要的是相应的操作。
在实际的调用中,存在DataXceiverServer对象,来监听Client通过Sender发送的各种请求,并创建出DataXceiver对象进行处理。
三、Sender
Sender类用于发起DataTransferProtocol请求。Sender类首先使用ProtoBuf将参数序列化, 然后用一个枚举类Op描述调用的是什么方法, 最后将序列化后的参数和Op一起发送给接收方。
下面代码以readBlock()方法为例,展示Sender的工作流程:
@Override
public void readBlock(final ExtendedBlock blk,
final Token<BlockTokenIdentifier> blockToken,
final String clientName,
final long blockOffset,
final long length,
final boolean sendChecksum,
final CachingStrategy cachingStrategy) throws IOException {
// 将所有参数序列化
OpReadBlockProto proto = OpReadBlockProto.newBuilder()
.setHeader(DataTransferProtoUtil.buildClientHeader(blk, clientName,
blockToken))
.setOffset(blockOffset)
.setLen(length)
.setSendChecksums(sendChecksum)
.setCachingStrategy(getCachingStrategy(cachingStrategy))
.build();
// 将序列化后的参数,加上操作码发送出去
send(out, Op.READ_BLOCK, proto);
}
private static void op(final DataOutput out, final Op op) throws IOException {
out.writeShort(DataTransferProtocol.DATA_TRANSFER_VERSION);
op.write(out);
}
private static void send(final DataOutputStream out, final Op opcode,
final Message proto) throws IOException {
LOG.trace("Sending DataTransferOp {}: {}",
proto.getClass().getSimpleName(), proto);
// 写入版本号和操作码
op(out, opcode);
// 写入序列化后的参数
proto.writeDelimitedTo(out);
out.flush();
}
Op是一个枚举类型, 使用一个byte类型的变量code标识操作码。 一个操作码对应DataTransferProtocol接口中的一个方法, 例如操作码80对应DataTransferProtocol.writeBlock()方法。
WRITE_BLOCK((byte)80),
READ_BLOCK((byte)81),
READ_METADATA((byte)82),
REPLACE_BLOCK((byte)83),
COPY_BLOCK((byte)84),
BLOCK_CHECKSUM((byte)85),
TRANSFER_BLOCK((byte)86),
REQUEST_SHORT_CIRCUIT_FDS((byte)87),
RELEASE_SHORT_CIRCUIT_FDS((byte)88),
REQUEST_SHORT_CIRCUIT_SHM((byte)89),
BLOCK_GROUP_CHECKSUM((byte)90),
CUSTOM((byte)127);
通过调用Sender类发起一个readBlock()操作时, Sender类会将读取数据块的请求通过IO流发送给远程的Datanode。 Datanode接收到这个请求后, 会调用Receiver类的对应方法执行readBlock()操作。
读取数据块请求的格式 :
首先是一个short类型的DataTransferProtocol版本号, 然后是byte类型的Op操作码, 最后是通过ProtoBuf序列化的readBlock()请求参数。
short -DataTransferProtocol版本号| byte -Op操作码(OpCode) |方法的序列化参数|
四、Receiver
Receiver是一个抽象类,提供了解析Sender请求操作码的readOp()方法, 以及处理Sender请求的processOp()方法。
Receiver类封装了DataTransferProtocol的执行操作,用于执行远程节点发起的流式接口请求,其子类DataXceiver真正实现了DataTransferProtocol的各种方法。
readOp()方法读出版本号进行校验,并返回操作码。
/** Read an Op. It also checks protocol version. */
protected final Op readOp() throws IOException {
// 先从数据流中读入DataTransferProtocol版本号, 并与当前版本号进行比对
final short version = in.readShort();
// 对比版本 , hadoop 3.2.0的版本是 --> 28
if (version != DataTransferProtocol.DATA_TRANSFER_VERSION) {
throw new IOException( "Version Mismatch (Expected: " +
DataTransferProtocol.DATA_TRANSFER_VERSION +
", Received: " + version + " )");
}
// 然后从数据流中读入Op, 并返回
return Op.read(in);
}
processOp()方法接收readOp()解析出的Op操作码作为参数, 在DataXceiver.run() 方法中, 不断循环调用, 针对不同的操作码调用指定的方法。
// 根据不同的操作码,进行不同的处理
protected final void processOp(Op op) throws IOException {
switch(op) {
case READ_BLOCK:
opReadBlock();
break;
case WRITE_BLOCK:
opWriteBlock(in);
break;
case REPLACE_BLOCK:
opReplaceBlock(in);
break;
case COPY_BLOCK:
opCopyBlock(in);
break;
case BLOCK_CHECKSUM:
opBlockChecksum(in);
break;
case BLOCK_GROUP_CHECKSUM:
opStripedBlockChecksum(in);
break;
case TRANSFER_BLOCK:
opTransferBlock(in);
break;
case REQUEST_SHORT_CIRCUIT_FDS:
opRequestShortCircuitFds(in);
break;
case RELEASE_SHORT_CIRCUIT_FDS:
opReleaseShortCircuitFds(in);
break;
case REQUEST_SHORT_CIRCUIT_SHM:
opRequestShortCircuitShm(in);
break;
default:
throw new IOException("Unknown op " + op + " in data stream");
}
}
以opReadBlock()为例,方法解析并反序列化了参数,并传递给readBlock方法,主要代码如下:
private void opReadBlock() throws IOException {
OpReadBlockProto proto = OpReadBlockProto.parseFrom(vintPrefixed(in));
TraceScope traceScope = continueTraceSpan(proto.getHeader(),
proto.getClass().getSimpleName());
try {
readBlock(PBHelperClient.convert(proto.getHeader().getBaseHeader().getBlock()),
PBHelperClient.convert(proto.getHeader().getBaseHeader().getToken()),
proto.getHeader().getClientName(),
proto.getOffset(),
proto.getLen(),
proto.getSendChecksums(),
(proto.hasCachingStrategy() ?
getCachingStrategy(proto.getCachingStrategy()) :
CachingStrategy.newDefaultStrategy()));
} finally {
if (traceScope != null) traceScope.close();
}
}
五、DataXceiver
DataXceiver主要用于响应流式请求。DataXceiver被DataXceiverServer创建出来,在run()方法中循环读取流式请求,直至pipeline关闭。run()方法如下:
@Override
public void run() {
int opsProcessed = 0;
Op op = null;
try {
synchronized(this) {
xceiver = Thread.currentThread();
}
dataXceiverServer.addPeer(peer, Thread.currentThread(), this);
peer.setWriteTimeout(datanode.getDnConf().socketWriteTimeout);
InputStream input = socketIn; // 获取底层输入流
try {
IOStreamPair saslStreams = datanode.saslServer.receive(peer, socketOut,
socketIn, datanode.getXferAddress().getPort(),
datanode.getDatanodeId());
input = new BufferedInputStream(saslStreams.in,
smallBufferSize); // 对输入流进行装饰
socketOut = saslStreams.out; // 获取底层的输出流
} catch (InvalidMagicNumberException imne) {
if (imne.isHandshake4Encryption()) {
LOG.info("Failed to read expected encryption handshake from client " +
"at {}. Perhaps the client " +
"is running an older version of Hadoop which does not support " +
"encryption", peer.getRemoteAddressString(), imne);
} else {
LOG.info("Failed to read expected SASL data transfer protection " +
"handshake from client at {}" +
". Perhaps the client is running an older version of Hadoop " +
"which does not support SASL data transfer protection",
peer.getRemoteAddressString(), imne);
}
return;
}
super.initialize(new DataInputStream(input)); // 初始化
// We process requests in a loop, and stay around for a short timeout.
// This optimistic behaviour allows the other end to reuse connections.
// Setting keepalive timeout to 0 disable this behavior.
do {
updateCurrentThreadName("Waiting for operation #" + (opsProcessed + 1));
try {
if (opsProcessed != 0) {
assert dnConf.socketKeepaliveTimeout > 0;
peer.setReadTimeout(dnConf.socketKeepaliveTimeout);
} else {
peer.setReadTimeout(dnConf.socketTimeout);
}
op = readOp(); // 解析操作码
} catch (InterruptedIOException ignored) {
// Time out while we wait for client rpc
break;
} catch (EOFException | ClosedChannelException e) {
// Since we optimistically expect the next op, it is quite normal to
// get EOF here.
LOG.debug("Cached {} closing after {} ops. " +
"This message is usually benign.", peer, opsProcessed);
break;
} catch (IOException err) {
incrDatanodeNetworkErrors();
throw err;
}
// restore normal timeout
if (opsProcessed != 0) {
peer.setReadTimeout(dnConf.socketTimeout);
}
opStartTime = monotonicNow();
processOp(op); // 处理流式请求
++opsProcessed;
} while ((peer != null) &&
(!peer.isClosed() && dnConf.socketKeepaliveTimeout > 0));
} catch (Throwable t) {
String s = datanode.getDisplayName() + ":DataXceiver error processing "
+ ((op == null) ? "unknown" : op.name()) + " operation "
+ " src: " + remoteAddress + " dst: " + localAddress;
if (op == Op.WRITE_BLOCK && t instanceof ReplicaAlreadyExistsException) {
// For WRITE_BLOCK, it is okay if the replica already exists since
// client and replication may write the same block to the same datanode
// at the same time.
if (LOG.isTraceEnabled()) {
LOG.trace(s, t);
} else {
LOG.info("{}; {}", s, t.toString());
}
} else if (op == Op.READ_BLOCK && t instanceof SocketTimeoutException) {
String s1 =
"Likely the client has stopped reading, disconnecting it";
s1 += " (" + s + ")";
if (LOG.isTraceEnabled()) {
LOG.trace(s1, t);
} else {
LOG.info("{}; {}", s1, t.toString());
}
} else if (t instanceof InvalidToken ||
t.getCause() instanceof InvalidToken) {
// The InvalidToken exception has already been logged in
// checkAccess() method and this is not a server error.
LOG.trace(s, t);
} else {
LOG.error(s, t);
}
} finally {
collectThreadLocalStates();
LOG.debug("{}:Number of active connections is: {}",
datanode.getDisplayName(), datanode.getXceiverCount());
updateCurrentThreadName("Cleaning up");
if (peer != null) {
dataXceiverServer.closePeer(peer);
IOUtils.closeStream(in);
}
}
}
至此,DataXceiver响应流式请求的流程介绍完毕了。至于对应实现DataTransferProtocol接口的方法,比较复杂,会另起篇章再行分析。