在hdfds中,文件的上传、打开、读取都是在主要的三个类:客户端DFSClient、Namenode或者加入DataNode交互作用:
上传文件到hdfs的流程中,
1、首先调用DistributedFileSystem.create,其实现如下:
create(Path f, FsPermission permission,
booleanoverwrite,
intbufferSize, short replication, long blockSize,
Progressableprogress) throws IOException {
returnnew FSDataOutputStream
(dfs.create(getPathName(f), permission,
overwrite, replication, blockSize, progress, bufferSize),
statistics);}
2、这里的dfs是DFSClient类型,追踪DFSClent可追踪到:
create(String src,FsPermission permission,
boolean overwrite,short replication,longblockSize,
Progressable progress,int buffersize) throwsIOException{}
在这个函数中主要包含下面这个调用:
OutputStream result = new DFSOutputStream(src, masked, overwrite,replication,
在DFSClient中找到:
DFSOutputStream(String src, FsPermission masked, booleanoverwrite,
boolean createParent,short replication,long blockSize, Progressable progress,int buffersize,int bytesPerChecksum)throws IOException {}
此函数主要有下面这个调用:
namenode.create(src,masked, clientName, overwrite,false, replication, blockSize);
和streamer.start(); //启动了一个pipeline,用于写数据
namenode是NameNode类型,在NameNode类中追踪create()函数
3、在NameNode类中的create()函数:
publicvoid create(String src,FsPermission masked,String clientName,
boolean overwrite,boolean createParent,short replication,long blockSize)throws IOException { }
在其中含有重要的调用:
namesystem.startFile(src,new PermissionStatus(UserGroupInformation.getCurrentUser().getShortUserName(),null, masked)
namesystem是FSNameSystem类型,跟踪FSNameSystem类:
private synchronized void startFileInternal(String src,
PermissionStatuspermissions,String holder,String clientMachine,boolean overwrite,
boolean append,short replication,long blockSize)throws IOException { }
此函数中有:
INodeFileUnderConstruction newNode =dir.addFile(src, permissions,
replication, blockSize, holder, clientMachine,clientNode, genstamp);
// 创建一个新的文件,状态为under construction,没有任何data block与之对应
4、然后客户端向新创建的文件中写入数据,一般会使用FSDataOutputStream的write函数,最终会调用DFSOutputStream的writeChunk函数,DFSOutputStream类是类DFSClient的内部类:
在hdfs的设计,对block的数据写入使用的是pipeline的方式,即将数据分成一个个的package,如果需要复制三分,分别写入DataNode1,2,3,进行如下的过程:
首先将package1写入DataNode1
然后由DataNode1负责将package1写入DataNode2,同时客户端可以将pacage2写入DataNode1
然后DataNode2负责将package1写入DataNode3, 同时客户端可以讲package3写入DataNode1,DataNode1将package2写入DataNode2
protectedsynchronizedvoid writeChunk(byte[] b,int offset, int len,byte[] checksum) throws IOException { …}
synchronized (dataQueue) {
//If queue is full, then wait till we cancreate enough space
while(!closed && dataQueue.size() + ackQueue.size() > maxPackets){
try {
dataQueue.wait(); //wait
} catch (InterruptedException e){
}
}
isClosed();
if (currentPacket == null) {
currentPacket = new Packet(packetSize, chunksPerPacket,bytesCurBlock);
if (LOG.isDebugEnabled()) {
LOG.debug("DFSClient writeChunk allocating new packet seqno=" +currentPacket.seqno +
currentPacket.writeChecksum(checksum, 0, cklen);
currentPacket.writeData(b, offset, len);
currentPacket.numChunks++;
bytesCurBlock += len;
// If packet is full, enqueue it for transmission
if (currentPacket.numChunks == currentPacket.maxChunks || bytesCurBlock == blockSize) {
if (LOG.isDebugEnabled()) {
LOG.debug("DFSClient writeChunk packet full seqno=" + currentPacket.seqno + ",src=" + src +
",bytesCurBlock=" + bytesCurBlock +
", blockSize="+ blockSize +
",appendChunk=" + appendChunk);
}
//if we allocated a new packet because we encountered a block boundary, reset bytesCurBlock.
if (bytesCurBlock == blockSize) {
currentPacket.lastPacketInBlock = true;
bytesCurBlock = 0;
lastFlushOffset = 0;
}
enqueueCurrentPacket();
//If this was the first write after reopening a file, then the above write filled up any partial chunk. Tell the summer
// to generate full crc chunks fromnow on.
if (appendChunk) {
appendChunk = false;
resetChecksumChunk(bytesPerChecksum);
}
int psize = Math.min((int)(blockSize-bytesCurBlock), writePacketSize);
computePacketChunkSize(psize, bytesPerChecksum);
}
}
同时前面提到的streamer.start(),streamer是DataStreamer类型,类DataStreamer也是DFSClient的内部类:
类中的方法:
publicvoid run() {…}内部的一些调用:
blockStream.write(buf.array(),buf.position(),buf.remaining());
//利用生成的写入流将数据写入DataNode中的block
blockStream.writeInt(0); //表明写入结束,其中blockStream是DataOutputStream类型
nodes= nextBlockOutputStream(src); 由NameNode分配block,并生成一个写入流指向此block
在DataStreamer类中:private DatanodeInfo[]nextBlockOutputStream(String client) throws IOException { }
其中包含有:
lb = locateFollowingBlock(startTime);
// 由NameNode为文件分配DataNode和block
再次进行追踪,在内部类DFSOutputStream中有
private LocatedBlocklocateFollowingBlock(long start,
DatanodeInfo[] excludedNodes)throws IOException {…}
其中最重要的是:
returnnamenode.addBlock(src, clientName, excludedNodes);
5、追踪NameNode类中的addblock(…)函数
publicLocatedBlock addBlock(String src,
String clientName) throws IOException {…}
其中有:
LocatedBlock locatedBlock = namesystem.getAdditionalBlock(src, clientName);
return locatedBlock;
继而可以看到要涉及类FSNamesystem中的getAdditionalBlock:
public LocatedBlockgetAdditionalBlock(String src, String clientName, List<Node> excludedNodes)throws IOException {…}
6、在客户端分配了DataNode和block以后,在内部类DFSOutputStream中的createBlockOutputStream开始写入数据,
privateboolean createBlockOutputStream(DatanodeInfo[] nodes, String client,boolean recoveryFlag) { …}
其中有:
{ LOG.debug("Connecting to "
//创建一个socket,链接DataNode
InetSocketAddress target = NetUtils.createSocketAddr(nodes[0].getName());
s =socketFactory.createSocket();
timeoutValue = 3000 * nodes.length +socketTimeout;
NetUtils.connect(s, target,timeoutValue);
s.setSoTimeout(timeoutValue);
s.setSendBufferSize(DEFAULT_DATA_SOCKET_SIZE);
LOG.debug("Send buf size " +s.getSendBufferSize());
long writeTimeout = HdfsConstants.WRITE_TIMEOUT_EXTENSION * nodes.length
datanodeWriteTimeout;
// Xmit headerinfo to datanode
DataOutputStream out = new DataOutputStream(
new BufferedOutputStream(NetUtils.getOutputStream(s, writeTimeout),
DataNode.SMALL_BUFFER_SIZE));blockReplyStream =new DataInputStream(NetUtils.getInputStream(s));
out.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION);
out.write( DataTransferProtocol.OP_WRITE_BLOCK );
block.getBlockId() );
block.getGenerationStamp());
length
// recovery flag
Text.writeString( out, client );
out.writeBoolean(false);// Not sending src node information
length
for (int i = 1; i < nodes.length; i++) {
nodes[i].write(out);
}
accessToken.write(out);
checksum.writeHeader( out );
out.flush();
// receive ackfor connect
blockReplyStream.readShort();
firstBadLink = Text.readString(blockReplyStream);
if (pipelineStatus != DataTransferProtocol.OP_STATUS_SUCCESS) {
if (pipelineStatus == DataTransferProtocol.OP_STATUS_ERROR_ACCESS_TOKEN) {
thrownew InvalidBlockTokenException(
"Got access token error for connectack with firstBadLink as "
+ firstBadLink);
} else {
thrownew IOException("Bad connect ack with firstBadLink as"
+ firstBadLink);
}
}
blockStream
result = true; // success
}
客户端在DataStreamer的run函数中创建了写入流后,调用blockStream.write将数据写入DataNode
7、最后要将block写到disk中,看一些资料说是在DataNode中:
DataNode的DataXceiver中,收到指令DataTransferProtocol.OP_WRITE_BLOCK则调用writeBlock函数,
但是类DataXceiver与类DataNode是怎么建立关系暂未搞清,DataXeciver中确实存在这个writeBlock()函数,源码中的注释也是说实现将block读到disk中:
privatevoid writeBlock(DataInputStream in)throws IOException {
DatanodeInfo srcDataNode = null;
LOG.debug("writeBlock receive buf size " +s.getReceiveBufferSize() +
" tcp nodelay " +s.getTcpNoDelay());
// Read in the header
Block block = newBlock(in.readLong(),
dataXceiverServer.estimateBlockSize, in.readLong());
LOG.info("Receiving block "
" src:" +remoteAddress
" dest:" +localAddress);
int pipelineSize = in.readInt();//num ofdatanodes inentire pipeline
boolean isRecovery = in.readBoolean();// is this part of recovery?
String client = Text.readString(in);// working on behalf of this client
boolean hasSrcDataNode = in.readBoolean();// issrc node info present
if (hasSrcDataNode) {
srcDataNode = new DatanodeInfo();
srcDataNode.readFields(in);
}
int numTargets = in.readInt();
if (numTargets < 0) {
thrownew IOException("Mislabelledincoming datastream.");
}
DatanodeInfo targets[] = new DatanodeInfo[numTargets];
for (int i = 0; i < targets.length; i++) {
DatanodeInfo tmp = new DatanodeInfo();
tmp.readFields(in);
targets[i] = tmp;
}
Token<BlockTokenIdentifier> accessToken =new Token<BlockTokenIdentifier>();
accessToken.readFields(in);
DataOutputStream replyOut = null; // stream toprev target
replyOut = new DataOutputStream(
NetUtils.getOutputStream(s,datanode.socketWriteTimeout));
if (datanode.isBlockTokenEnabled) {
try {
datanode.blockTokenSecretManager.checkAccess(accessToken,null, block,
BlockTokenSecretManager.AccessMode.WRITE);
}catch (InvalidToken e) {
try {
if (client.length() != 0) {
replyOut.writeShort((short)DataTransferProtocol.OP_STATUS_ERROR_ACCESS_TOKEN);
Text.writeString(replyOut,datanode.dnRegistration.getName());
replyOut.flush();
}
thrownew IOException("Access token verification failed, forclient "
remoteAddress + " for OP_WRITE_BLOCK for block "
} finally {
IOUtils.closeStream(replyOut);
}
}
}
DataOutputStream mirrorOut = null; // stream to next target
DataInputStream mirrorIn = null; // reply from next target
Socket mirrorSock = null; // socket to next target
BlockReceiver blockReceiver = null; // responsible for data handling
String mirrorNode = null; // the name:port of next target
""; // first datanode that failed inconnection setup
short mirrorInStatus = (short)DataTransferProtocol.OP_STATUS_SUCCESS;
try {
// open a block receiver and check ifthe block does not exist
blockReceiver = newBlockReceiver(block, in,
s.getRemoteSocketAddress().toString(),
s.getLocalSocketAddress().toString(),
datanode);
// Open networkconn to backupmachine, if appropriate
if (targets.length
InetSocketAddress mirrorTarget =null;
// Connect tobackup machine
mirrorNode = targets[0].getName();
mirrorTarget = NetUtils.createSocketAddr(mirrorNode);
datanode.newSocket();
try {
int timeoutValue =datanode.socketTimeout
(HdfsConstants.READ_TIMEOUT_EXTENSION * numTargets);
int writeTimeout =datanode.socketWriteTimeout
(HdfsConstants.WRITE_TIMEOUT_EXTENSION * numTargets);
NetUtils.connect(mirrorSock, mirrorTarget, timeoutValue);
mirrorSock.setSoTimeout(timeoutValue);
mirrorSock.setSendBufferSize(DEFAULT_DATA_SOCKET_SIZE);
mirrorOut = new DataOutputStream(
new BufferedOutputStream(
NetUtils.getOutputStream(mirrorSock,writeTimeout),
SMALL_BUFFER_SIZE));
mirrorIn = newDataInputStream(NetUtils.getInputStream(mirrorSock));
// Write header:Copied from DFSClient.java!
mirrorOut.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION );
mirrorOut.write( DataTransferProtocol.OP_WRITE_BLOCK );
mirrorOut.writeLong( block.getBlockId() );
mirrorOut.writeLong( block.getGenerationStamp() );
mirrorOut.writeInt( pipelineSize );
mirrorOut.writeBoolean( isRecovery );
Text.writeString( mirrorOut, client );
mirrorOut.writeBoolean(hasSrcDataNode);
if (hasSrcDataNode) {// passsrc node information
srcDataNode.write(mirrorOut);
}
length
for (int i = 1; i < targets.length; i++ ) {
targets[i].write( mirrorOut );
}
accessToken.write(mirrorOut);
blockReceiver.writeChecksumHeader(mirrorOut);
mirrorOut.flush();
// read connectack(only for clients, not for replicationreq)
if (client.length() != 0) {
mirrorInStatus = mirrorIn.readShort();
firstBadLink = Text.readString(mirrorIn);
if (LOG.isDebugEnabled() || mirrorInStatus !=DataTransferProtocol.OP_STATUS_SUCCESS) {
LOG.info("Datanode " + targets.length
" got response for connect ack "
" from downstream datanode withfirstbadlink as "
firstBadLink);
}
}
} catch (IOException e) {
if (client.length() != 0) {
replyOut.writeShort((short)DataTransferProtocol.OP_STATUS_ERROR);
Text.writeString(replyOut, mirrorNode);
replyOut.flush();
}
IOUtils.closeStream(mirrorOut);
mirrorOut = null;
IOUtils.closeStream(mirrorIn);
mirrorIn = null;
IOUtils.closeSocket(mirrorSock);
mirrorSock = null;
if (client.length() > 0) {
throw e;
} else {
LOG.info(datanode.dnRegistration +":Exceptiontransfering block "
" to mirror "
". continuing without themirror.\n"
StringUtils.stringifyException(e)); }
}
}
// send connectack back tosource (only for clients)
if (client.length() != 0) {
if (LOG.isDebugEnabled() || mirrorInStatus !=DataTransferProtocol.OP_STATUS_SUCCESS) {
LOG.info("Datanode " + targets.length
" forwarding connect ack to upstreamfirstbadlink is "
firstBadLink);
}
replyOut.writeShort(mirrorInStatus);
Text.writeString(replyOut, firstBadLink);
replyOut.flush();
}
// receive the block and mirror to thenext target
String mirrorAddr = (mirrorSock ==null) ?null : mirrorNode;
receiveBlock(mirrorOut,mirrorIn, replyOut,
mirrorAddr,null, targets.length);
// if this write is for a replication request (and not
// from a client), then confirm block.For client-writes,
// the block is finalized in thePacketResponder.
if (client.length() == 0) {
datanode.notifyNamenodeReceivedBlock(block,DataNode.EMPTY_DEL_HINT);
LOG.info("Received block "
" src: " + remoteAddress
" dest: " + localAddress
" of size "
}
if (datanode.blockScanner !=null) {
datanode.blockScanner.addBlock(block);
}
} catch (IOException ioe) {
LOG.info("writeBlock " + block +" received exception "
throw ioe;
} finally {
// close all opened streams
IOUtils.closeStream(mirrorOut);
IOUtils.closeStream(mirrorIn);
IOUtils.closeStream(replyOut);
IOUtils.closeSocket(mirrorSock);
IOUtils.closeStream(blockReceiver);
}
}
在这个过程还涉及许多相关类,还要进一步分析!