作者 | 吴邪
这篇文章我们分享HDFS读取数据的流程,相对于写数据流程来说,读数据的流程会简单不少,写完这一篇之后,对HDFS的核心代码剖析算是告一段落了,这一系列包含了NameNode的初始化、DataNode的初始、元数据管理、HDFS写数据流程、HDFS读数据流程五个核心部分,毕竟HDFS是一个百万行级别代码的技术架构,内容非常多,所以本系列只选取HDFS关键且核心的功能点来剖析。
HDFS读数据流程
图1:读数据流程图
- HDFS 客户端调用DistributedFileSystem 类(FileSystem的实现类)的open()方法。
- DistributedFileSystem 通过NameNodeRPC与NameNode建立通信,调用getBlockLocations()方法,请求block数据块的存储位置。
- DistributedFileSystem 返回一个FSDataInputStream对象给客户端,FSDataInputStream携带有block数据库的元数据信息,客户端调用FSDataInputStream的read()方法,请求存放目标block最近的DataNode节点。
- DataNode和NameNode之间以数据流的方式进行通信,保证客户端可以重复调用read()方法进行读取,选择就近的DataNode读取需要的数据块信息,如果发现读取的DataNode有异常,则尝试读取下一个DataNode的数据,直至读完最后一个数据块。
- 当读取完文件之后,调用close()方法关闭DataNode和NameNode连接。
实际上,从流程上可以看出HDFS读数据数据和写入数据总体的流程是差不多的,但是读取数据会简单一些,下面我们开始进行源码的分析。
读取数据
我们还是根据流程图的过程进行分析,这样整个过程更加清晰。
找到FSDataInputStream这个类的open()方法,传入数据路径。
/**
* Opens an FSDataInputStream at the indicated Path.
* @param f the file to open
*/
public FSDataInputStream open(Path f) throws IOException {
return open(f, getConf().getInt("io.file.buffer.size", 4096));
}
/**
* Opens an FSDataInputStream at the indicated Path.
* @param f the file name to open
* @param bufferSize the size of the buffer to be used.
*/
public abstract FSDataInputStream open(Path f, int bufferSize)
throws IOException;
显而易见,open(...)方法中传入了文件路径以及配置文件中设置的文件缓冲区大小为4M。继续点进去open方法发现是一个抽象方法,那下一步就应该找到实现类的open(...)方法,根据流程图可以知道FileSystem的实现类就是DistributedFileSystem,不言而喻,直接定位到DistributedFileSystem这个类的open方法准没错。
图2:FileSystem的实现类和方法
@Override
public FSDataInputStream open(Path f, final int bufferSize)
throws IOException {
//计算读取操作的次数进行累加并记录
statistics.incrementReadOps(1);
//判断数据路径是绝对路径还是相对路径,不重要
Path absF = fixRelativePart(f);
return new FileSystemLinkResolver<FSDataInputStream>() {
@Override
public FSDataInputStream doCall(final Path p)
throws IOException, UnresolvedLinkException {
//重要代码,重点关注,跟写数据的套路差不多
final DFSInputStream dfsis =
dfs.open(getPathName(p), bufferSize, verifyChecksum);
return dfs.createWrappedInputStream(dfsis);
}
@Override
public FSDataInputStream next(final FileSystem fs, final Path p)
throws IOException {
return fs.open(p, bufferSize);
}
}.resolve(this, absF);
}
重点关注以上dfs.open(xxx)方法,调用之前会通过文件路径判断文件是否属于当前的文件系统。
/**
* Create an input stream that obtains a nodelist from the namenode, and then
* reads from all the right places. Creates inner subclass of InputStream that
* does the right out-of-band work.
*/
public DFSInputStream open(String src, int buffersize, boolean verifyChecksum)
throws IOException, UnresolvedLinkException {
//检查文件是否处于打开状态,无关紧要的方法
checkOpen();
// Get block info from namenode,从namenode获取block信息
TraceScope scope = getPathTraceScope("newDFSInputStream", src);
try {
return new DFSInputStream(this, src, verifyChecksum);
} finally {
scope.close();
}
}
方法的最后返回了DFSInputStream(xxx)这个构造函数,并且在构造函数中调用了openInfo()方法。
DFSInputStream(DFSClient dfsClient, String src, boolean verifyChecksum
) throws IOException, UnresolvedLinkException {
this.dfsClient = dfsClient;
this.verifyChecksum = verifyChecksum;
this.src = src;
synchronized (infoLock) {
this.cachingStrategy = dfsClient.getDefaultReadCachingStrategy();
}
openInfo();
}
/**
* Grab the open-file info from namenode
* 从namenode获取要打开的文件对应的blcok信息
*/
void openInfo() throws IOException, UnresolvedLinkException {
synchronized(infoLock) {
//划重点,对应流程图的步骤二,从namenode获取block信息
lastBlockBeingWrittenLength = fetchLocatedBlocksAndGetLastBlockLength();
int retriesForLastBlockLength = dfsClient.getConf().retryTimesForGetLastBlockLength;
//为了保证读取成功,特意用了while循环增强,循环调用fetchLocatedBlocksAndGetLastBlockLength()
while (retriesForLastBlockLength > 0) {
// Getting last block length as -1 is a special case. When cluster
// restarts, DNs may not report immediately. At this time partial block
// locations will not be available with NN for getting the length. Lets
// retry for 3 times to get the length.
if (lastBlockBeingWrittenLength == -1) {
DFSClient.LOG.warn("Last block locations not available. "
+ "Datanodes might not have reported blocks completely."
+ " Will retry for " + retriesForLastBlockLength + " times");
waitFor(dfsClient.getConf().retryIntervalForGetLastBlockLength);
lastBlockBeingWrittenLength = fetchLocatedBlocksAndGetLastBlockLength();
} else {
break;
}
retriesForLastBlockLength--;
}
if (retriesForLastBlockLength == 0) {
throw new IOException("Could not obtain the last block locations.");
}
}
}
对应流程图步骤二的getBlockLocations方法,详情请看fetchLocatedBlocksAndGetLastBlockLength()方法。
private long fetchLocatedBlocksAndGetLastBlockLength() throws IOException {
//调用DFSClient的getLocatedBlocks方法,通过文件路径获取blcok存储位置信息
final LocatedBlocks newInfo = dfsClient.getLocatedBlocks(src, 0);
if (DFSClient.LOG.isDebugEnabled()) {
DFSClient.LOG.debug("newInfo = " + newInfo);
}
if (newInfo == null) {
throw new IOException("Cannot open filename " + src);
}
if (locatedBlocks != null) {
Iterator<LocatedBlock> oldIter = locatedBlocks.getLocatedBlocks().iterator();
Iterator<LocatedBlock> newIter = newInfo.getLocatedBlocks().iterator();
while (oldIter.hasNext() && newIter.hasNext()) {
if (! oldIter.next().getBlock().equals(newIter.next().getBlock())) {
throw new IOException("Blocklist for " + src + " has changed!");
}
}
}
locatedBlocks = newInfo;
long lastBlockBeingWrittenLength = 0;
if (!locatedBlocks.isLastBlockComplete()) {
final LocatedBlock last = locatedBlocks.getLastLocatedBlock();
if (last != null) {
if (last.getLocations().length == 0) {
if (last.getBlockSize() == 0) {
// if the length is zero, then no data has been written to
// datanode. So no need to wait for the locations.
return 0;
}
return -1;
}
final long len = readBlockLength(last);
last.getBlock().setNumBytes(len);
lastBlockBeingWrittenLength = len;
}
}
fileEncryptionInfo = locatedBlocks.getFileEncryptionInfo();
return lastBlockBeingWrittenLength;
}
重点关注DFSClient的getBlockLocations()方法,从namenode获取block位置信息。
public LocatedBlocks getLocatedBlocks(String src, long start) throws IOException {
return getLocatedBlocks(src, start, dfsClientConf.prefetchSize);
}
/*
* This is just a wrapper around callGetBlockLocations, but non-static so that
* we can stub it out for tests.
*/
@VisibleForTesting
public LocatedBlocks getLocatedBlocks(String src, long start, long length) throws IOException {
TraceScope scope = getPathTraceScope("getBlockLocations", src);
try {
return callGetBlockLocations(namenode, src, start, length);
} finally {
scope.close();
}
}
/**
* @see ClientProtocol#getBlockLocations(String, long, long)
*/
static LocatedBlocks callGetBlockLocations(ClientProtocol namenode, String src, long start, long length)
throws IOException {
try {
//通过RPC远程调用NameNodeRPCServer
return namenode.getBlockLocations(src, start, length);
} catch (RemoteException re) {
throw re.unwrapRemoteException(AccessControlException.class, FileNotFoundException.class,
UnresolvedPathException.class);
}
}
@Idempotent
public LocatedBlocks getBlockLocations(String src,
long offset,
long length)
throws AccessControlException, FileNotFoundException,
UnresolvedLinkException, IOException;
可以看到返回的是LocatedBlocks对象,包含了List<LocatedBlock> blocks,封装了block的信息,以及block在文件中的偏移量,还有block对应DataNode的位置信息。原理上是RPC调用了NameNodeRPCServer的getBlockLocations()方法。
@Override // ClientProtocol
public LocatedBlocks getBlockLocations(String src,
long offset,
long length)
throws IOException {
//检查NameNode是否已经启动
checkNNStartup();
//计算获取到block信息并记录变化
metrics.incrGetBlockLocations();
return namesystem.getBlockLocations(getClientMachine(),
src, offset, length);
}
/**
* Get block locations within the specified range.
* @see ClientProtocol#getBlockLocations(String, long, long)
*/
LocatedBlocks getBlockLocations(String clientMachine, String src,
long offset, long length) throws IOException {
checkOperation(OperationCategory.READ);
//创建Block信息结果对象
GetBlockLocationsResult res = null;
readLock();
try {
checkOperation(OperationCategory.READ);
//获取block位置信息
res = getBlockLocations(src, offset, length, true, true);
} catch (AccessControlException e) {
logAuditEvent(false, "open", src);
throw e;
} finally {
readUnlock();
}
logAuditEvent(true, "open", src);
if (res.updateAccessTime()) {
writeLock();
final long now = now();
try {
checkOperation(OperationCategory.WRITE);
INode inode = res.iip.getLastINode();
boolean updateAccessTime = now > inode.getAccessTime() +
getAccessTimePrecision();
if (!isInSafeMode() && updateAccessTime) {
boolean changed = FSDirAttrOp.setTimes(dir,
inode, -1, now, false, res.iip.getLatestSnapshotId());
if (changed) {
getEditLog().logTimes(src, -1, now);
}
}
} catch (Throwable e) {
LOG.warn("Failed to update the access time of " + src, e);
} finally {
writeUnlock();
}
}
//将获取到的block信息赋值给LocatedBlocks
LocatedBlocks blocks = res.blocks;
if (blocks != null) {
blockManager.getDatanodeManager().sortLocatedBlocks(
clientMachine, blocks.getLocatedBlocks());
// lastBlock is not part of getLocatedBlocks(), might need to sort it too
//获取到最后一个Block的位置信息
LocatedBlock lastBlock = blocks.getLastLocatedBlock();
if (lastBlock != null) {
ArrayList<LocatedBlock> lastBlockList = Lists.newArrayList(lastBlock);
blockManager.getDatanodeManager().sortLocatedBlocks(
clientMachine, lastBlockList);
}
}
//返回LocatedBlocks对象,封装了目标文件包含的所有block的位置信息
return blocks;
}
以上就完成了步骤二获取block位置信息的分析,同样的将返回的DFSInputStream对象传递给createWrappedInputStream(...)方法中进行再次封装。接下来根据NameNode返回的LocatedBlocks对象信息,请求FSDataInputStream的 read()方法。
/**
* Read bytes from the given position in the stream to the given buffer.
*
* @param position position in the input stream to seek
* @param buffer buffer into which data is read
* @param offset offset into the buffer in which data is written
* @param length maximum number of bytes to read
* @return total number of bytes read into the buffer, or <code>-1</code>
* if there is no more data because the end of the stream has been
* reached
*/
@Override
public int read(long position, byte[] buffer, int offset, int length)
throws IOException {
return ((PositionedReadable)in).read(position, buffer, offset, length);
}
FSDataInputStream会调用其封装的DFSInputStream的read(xxx)方法。
/**
* Read bytes starting from the specified position.
*
* @param position start read from this position
* @param buffer read buffer
* @param offset offset into buffer
* @param length number of bytes to read
*
* @return actual number of bytes read
*/
@Override
public int read(long position, byte[] buffer, int offset, int length)
throws IOException {
TraceScope scope =
dfsClient.getPathTraceScope("DFSInputStream#byteArrayPread", src);
try {
return pread(position, buffer, offset, length);
} finally {
scope.close();
}
}
private int pread(long position, byte[] buffer, int offset, int length)
throws IOException {
// sanity checks,检查文件系统是否运行中
dfsClient.checkOpen();
if (closed.get()) {
throw new IOException("Stream closed");
}
failures = 0;
//获取LocatedBlocks的长度
long filelen = getFileLength();
if ((position < 0) || (position >= filelen)) {
return -1;
}
int realLen = length;
if ((position + length) > filelen) {
realLen = (int)(filelen - position);
}
// determine the block and byte range within the block
// corresponding to position and realLen
//得到从offset到offset + length范围内的block列表
List<LocatedBlock> blockRange = getBlockRange(position, realLen);
int remaining = realLen;
Map<ExtendedBlock,Set<DatanodeInfo>> corruptedBlockMap
= new HashMap<ExtendedBlock, Set<DatanodeInfo>>();
//对block列表进行遍历,读取需要的block数据,因为需要的数据不一定是存在一个block列表中,通常分布在多个block
for (LocatedBlock blk : blockRange) {
long targetStart = position - blk.getStartOffset();
long bytesToRead = Math.min(remaining, blk.getBlockSize() - targetStart);
try {
if (dfsClient.isHedgedReadsEnabled()) {
hedgedFetchBlockByteRange(blk, targetStart, targetStart + bytesToRead
- 1, buffer, offset, corruptedBlockMap);
} else {
fetchBlockByteRange(blk, targetStart, targetStart + bytesToRead - 1,
buffer, offset, corruptedBlockMap);
}
} finally {
// Check and report if any block replicas are corrupted.
// BlockMissingException may be caught if all block replicas are
// corrupted.
reportCheckSumFailure(corruptedBlockMap, blk.getLocations().length);
}
remaining -= bytesToRead;
position += bytesToRead;
offset += bytesToRead;
}
assert remaining == 0 : "Wrong number of bytes read.";
if (dfsClient.stats != null) {
dfsClient.stats.incrementBytesRead(realLen);
}
return realLen;
}
分析getBlockRange(xxx)方法,通过指定的范围从namenode获取数据,优先从缓存中获取。
/**
* Get blocks in the specified range.
* Fetch them from the namenode if not cached. This function
* will not get a read request beyond the EOF.
* @param offset starting offset in file
* @param length length of data
* @return consequent segment of located blocks
* @throws IOException
*/
private List<LocatedBlock> getBlockRange(long offset,
long length) throws IOException {
// getFileLength(): returns total file length
// locatedBlocks.getFileLength(): returns length of completed blocks
//通常offset是要小于文件长度的
if (offset >= getFileLength()) {
throw new IOException("Offset: " + offset +
" exceeds file length: " + getFileLength());
}
//之前有说到,block的状态有两种,一种是complete写入完成的,另一种是uncomplete构建中的状态
synchronized(infoLock) {
final List<LocatedBlock> blocks;
//得到locatedBlocks的长度
final long lengthOfCompleteBlk = locatedBlocks.getFileLength();
final boolean readOffsetWithinCompleteBlk = offset < lengthOfCompleteBlk;
final boolean readLengthPastCompleteBlk = offset + length > lengthOfCompleteBlk;
if (readOffsetWithinCompleteBlk) {
//get the blocks of finalized (completed) block range,
blocks = getFinalizedBlockRange(offset,
Math.min(length, lengthOfCompleteBlk - offset));
} else {
blocks = new ArrayList<LocatedBlock>(1);
}
// get the blocks from incomplete block range
if (readLengthPastCompleteBlk) {
blocks.add(locatedBlocks.getLastLocatedBlock());
}
return blocks;
}
}
/**
* Get blocks in the specified range.
* Includes only the complete blocks.
* Fetch them from the namenode if not cached.
*/
private List<LocatedBlock> getFinalizedBlockRange(
long offset, long length) throws IOException {
synchronized(infoLock) {
assert (locatedBlocks != null) : "locatedBlocks is null";
List<LocatedBlock> blockRange = new ArrayList<LocatedBlock>();
// search cached blocks first
//首先会先从缓存的locatedBlocks中查找offset所在的block在缓存链表中的位置
int blockIdx = locatedBlocks.findBlock(offset);
if (blockIdx < 0) { // block is not cached,无缓存
blockIdx = LocatedBlocks.getInsertIndex(blockIdx);
}
long remaining = length;
long curOff = offset;
while(remaining > 0) {
LocatedBlock blk = null;
if(blockIdx < locatedBlocks.locatedBlockCount())
//根据blcokIdx找到block
blk = locatedBlocks.get(blockIdx);
//说明没有缓存,从NameNode查找block并添加到缓存
if (blk == null || curOff < blk.getStartOffset()) {
LocatedBlocks newBlocks;
newBlocks = dfsClient.getLocatedBlocks(src, curOff, remaining);
locatedBlocks.insertRange(blockIdx, newBlocks.getLocatedBlocks());
continue;
}
assert curOff >= blk.getStartOffset() : "Block not found";
blockRange.add(blk);
long bytesRead = blk.getStartOffset() + blk.getBlockSize() - curOff;
remaining -= bytesRead;
curOff += bytesRead;
//继续读取下一个block
blockIdx++;
}
return blockRange;
}
}
其中我们看一下pread(xxx)方法下引用的fetchBlockByteRange(...)方法。
private void fetchBlockByteRange(LocatedBlock block, long start, long end,
byte[] buf, int offset,
Map<ExtendedBlock, Set<DatanodeInfo>> corruptedBlockMap)
throws IOException {
//通过偏移量获取LocatedBlock 对象
block = getBlockAt(block.getStartOffset());
//熟悉的while循环,为了最大程度上保证成功获取数据
while (true) {
//选择就近的一个DataNode进行读取
DNAddrPair addressPair = chooseDataNode(block, null);
try {
//通过选择的DataNode,根据block的起始偏移量开始获取数据,获取完成后return;结束循环
actualGetFromOneDataNode(addressPair, block, start, end, buf, offset,
corruptedBlockMap);
return;
} catch (IOException e) {
// Ignore. Already processed inside the function.
// Loop through to try the next node.
}
}
}
执行完actualGetFromOneDataNode(...)方法获取完数据之后会执行close()方法结束连接。
private void actualGetFromOneDataNode(final DNAddrPair datanode,
LocatedBlock block, final long start, final long end, byte[] buf,
int offset, Map<ExtendedBlock, Set<DatanodeInfo>> corruptedBlockMap)
throws IOException {
DFSClientFaultInjector.get().startFetchFromDatanode();
int refetchToken = 1; // only need to get a new access token once
int refetchEncryptionKey = 1; // only need to get a new encryption key once
while (true) {
// cached block locations may have been updated by chooseDataNode()
// or fetchBlockAt(). Always get the latest list of locations at the
// start of the loop.
CachingStrategy curCachingStrategy;
boolean allowShortCircuitLocalReads;
block = getBlockAt(block.getStartOffset());
synchronized(infoLock) {
curCachingStrategy = cachingStrategy;
allowShortCircuitLocalReads = !shortCircuitForbidden();
}
DatanodeInfo chosenNode = datanode.info;
InetSocketAddress targetAddr = datanode.addr;
StorageType storageType = datanode.storageType;
//初始化BlockReader
BlockReader reader = null;
try {
DFSClientFaultInjector.get().fetchFromDatanodeException();
Token<BlockTokenIdentifier> blockToken = block.getBlockToken();
int len = (int) (end - start + 1);
//reader负责从DataNode读取数据,构建socket连接到DataNode
reader = new BlockReaderFactory(dfsClient.getConf()).
setInetSocketAddress(targetAddr).
setRemotePeerFactory(dfsClient).
setDatanodeInfo(chosenNode).
setStorageType(storageType).
setFileName(src).
setBlock(block.getBlock()).
setBlockToken(blockToken).
setStartOffset(start).
setVerifyChecksum(verifyChecksum).
setClientName(dfsClient.clientName).
setLength(len).
setCachingStrategy(curCachingStrategy).
setAllowShortCircuitLocalReads(allowShortCircuitLocalReads).
setClientCacheContext(dfsClient.getClientContext()).
setUserGroupInformation(dfsClient.ugi).
setConfiguration(dfsClient.getConfiguration()).
build();
//读取数据
int nread = reader.readAll(buf, offset, len);
updateReadStatistics(readStatistics, nread, reader);
if (nread != len) {
throw new IOException("truncated return from reader.read(): " +
"excpected " + len + ", got " + nread);
}
DFSClientFaultInjector.get().readFromDatanodeDelay();
return;
} catch (ChecksumException e) {
String msg = "fetchBlockByteRange(). Got a checksum exception for "
+ src + " at " + block.getBlock() + ":" + e.getPos() + " from "
+ chosenNode;
DFSClient.LOG.warn(msg);
// we want to remember what we have tried
addIntoCorruptedBlockMap(block.getBlock(), chosenNode, corruptedBlockMap);
//如果读取失败,则将该DataNode标记位异常节点
addToDeadNodes(chosenNode);
throw new IOException(msg);
} catch (IOException e) {
if (e instanceof InvalidEncryptionKeyException && refetchEncryptionKey > 0) {
DFSClient.LOG.info("Will fetch a new encryption key and retry, "
+ "encryption key was invalid when connecting to " + targetAddr
+ " : " + e);
// The encryption key used is invalid.
refetchEncryptionKey--;
dfsClient.clearDataEncryptionKey();
continue;
} else if (refetchToken > 0 && tokenRefetchNeeded(e, targetAddr)) {
refetchToken--;
try {
fetchBlockAt(block.getStartOffset());
} catch (IOException fbae) {
// ignore IOE, since we can retry it later in a loop
}
continue;
} else {
String msg = "Failed to connect to " + targetAddr + " for file "
+ src + " for block " + block.getBlock() + ":" + e;
DFSClient.LOG.warn("Connection failure: " + msg, e);
addToDeadNodes(chosenNode);
throw new IOException(msg);
}
} finally {
if (reader != null) {
reader.close();
}
}
}
}
本文分析了HDFS写数据流程,从HDFS客户端调用DistributedFileSystem 和FSDataInputStream这两个核心类的方法,通过NameNodeRPC调用NameNode对应的实现方法,获取目标block数据块的元数据信息,通过得到的元数据信息从对应的DataNode节点读取数据,直到读完最后一个block,关闭DataNode和NameNode之间的数据流,完成数据读取。至此完成了HDFS读取数据核心源码的剖析,可以与上一篇文章《Hadoop核心源码剖析(写数据)》串起来一起读,或许会有更多的收获,有关更多细节可以自己动手深究。
总结
本篇文章也是Hadoop HDFS核心源码剖析的完结篇,HDFS主要的核心功能模块基本都涉及到了.