作者 | 吴邪 


这篇文章我们分享HDFS读取数据的流程,相对于写数据流程来说,读数据的流程会简单不少,写完这一篇之后,对HDFS的核心代码剖析算是告一段落了,这一系列包含了NameNode的初始化、DataNode的初始、元数据管理、HDFS写数据流程、HDFS读数据流程五个核心部分,毕竟HDFS是一个百万行级别代码的技术架构,内容非常多,所以本系列只选取HDFS关键且核心的功能点来剖析。

 

HDFS读数据流程

hdfs fuse源码 hdfs源码剖析_hdfs fuse源码

图1:读数据流程图

 

  1. HDFS 客户端调用DistributedFileSystem 类(FileSystem的实现类)的open()方法。
  2. DistributedFileSystem 通过NameNodeRPC与NameNode建立通信,调用getBlockLocations()方法,请求block数据块的存储位置。
  3. DistributedFileSystem 返回一个FSDataInputStream对象给客户端,FSDataInputStream携带有block数据库的元数据信息,客户端调用FSDataInputStream的read()方法,请求存放目标block最近的DataNode节点。
  4. DataNode和NameNode之间以数据流的方式进行通信,保证客户端可以重复调用read()方法进行读取,选择就近的DataNode读取需要的数据块信息,如果发现读取的DataNode有异常,则尝试读取下一个DataNode的数据,直至读完最后一个数据块。
  5. 当读取完文件之后,调用close()方法关闭DataNode和NameNode连接。

 

实际上,从流程上可以看出HDFS读数据数据和写入数据总体的流程是差不多的,但是读取数据会简单一些,下面我们开始进行源码的分析。

 

读取数据

我们还是根据流程图的过程进行分析,这样整个过程更加清晰。

找到FSDataInputStream这个类的open()方法,传入数据路径。

/**
 * Opens an FSDataInputStream at the indicated Path.
 * @param f the file to open
 */
public FSDataInputStream open(Path f) throws IOException {
  return open(f, getConf().getInt("io.file.buffer.size", 4096));
}
/**
 * Opens an FSDataInputStream at the indicated Path.
 * @param f the file name to open
 * @param bufferSize the size of the buffer to be used.
 */
public abstract FSDataInputStream open(Path f, int bufferSize)
  throws IOException;

显而易见,open(...)方法中传入了文件路径以及配置文件中设置的文件缓冲区大小为4M。继续点进去open方法发现是一个抽象方法,那下一步就应该找到实现类的open(...)方法,根据流程图可以知道FileSystem的实现类就是DistributedFileSystem,不言而喻,直接定位到DistributedFileSystem这个类的open方法准没错。

hdfs fuse源码 hdfs源码剖析_python_02

图2:FileSystem的实现类和方法

@Override
public FSDataInputStream open(Path f, final int bufferSize)
    throws IOException {
  //计算读取操作的次数进行累加并记录
  statistics.incrementReadOps(1);
  //判断数据路径是绝对路径还是相对路径,不重要
  Path absF = fixRelativePart(f);
  return new FileSystemLinkResolver<FSDataInputStream>() {
    @Override
    public FSDataInputStream doCall(final Path p)
        throws IOException, UnresolvedLinkException {
      //重要代码,重点关注,跟写数据的套路差不多
      final DFSInputStream dfsis =
        dfs.open(getPathName(p), bufferSize, verifyChecksum);
      return dfs.createWrappedInputStream(dfsis);
    }
    @Override
    public FSDataInputStream next(final FileSystem fs, final Path p)
        throws IOException {
      return fs.open(p, bufferSize);
    }
  }.resolve(this, absF);
}

重点关注以上dfs.open(xxx)方法,调用之前会通过文件路径判断文件是否属于当前的文件系统。

/**
 * Create an input stream that obtains a nodelist from the namenode, and then
 * reads from all the right places. Creates inner subclass of InputStream that
 * does the right out-of-band work.
 */
public DFSInputStream open(String src, int buffersize, boolean verifyChecksum)
      throws IOException, UnresolvedLinkException {
    //检查文件是否处于打开状态,无关紧要的方法
   checkOpen();
   // Get block info from namenode,从namenode获取block信息
   TraceScope scope = getPathTraceScope("newDFSInputStream", src);
   try {
      return new DFSInputStream(this, src, verifyChecksum);
   } finally {
      scope.close();
   }
}

方法的最后返回了DFSInputStream(xxx)这个构造函数,并且在构造函数中调用了openInfo()方法。

DFSInputStream(DFSClient dfsClient, String src, boolean verifyChecksum
               ) throws IOException, UnresolvedLinkException {
  this.dfsClient = dfsClient;
  this.verifyChecksum = verifyChecksum;
  this.src = src;
  synchronized (infoLock) {
    this.cachingStrategy = dfsClient.getDefaultReadCachingStrategy();
  }
  openInfo();
}
/**
 * Grab the open-file info from namenode
 * 从namenode获取要打开的文件对应的blcok信息
 */
void openInfo() throws IOException, UnresolvedLinkException {
  synchronized(infoLock) {
  //划重点,对应流程图的步骤二,从namenode获取block信息
    lastBlockBeingWrittenLength = fetchLocatedBlocksAndGetLastBlockLength();
    int retriesForLastBlockLength = dfsClient.getConf().retryTimesForGetLastBlockLength;
    //为了保证读取成功,特意用了while循环增强,循环调用fetchLocatedBlocksAndGetLastBlockLength()
    while (retriesForLastBlockLength > 0) {
      // Getting last block length as -1 is a special case. When cluster
      // restarts, DNs may not report immediately. At this time partial block
      // locations will not be available with NN for getting the length. Lets
      // retry for 3 times to get the length.
      if (lastBlockBeingWrittenLength == -1) {
        DFSClient.LOG.warn("Last block locations not available. "
            + "Datanodes might not have reported blocks completely."
            + " Will retry for " + retriesForLastBlockLength + " times");
        waitFor(dfsClient.getConf().retryIntervalForGetLastBlockLength);
        lastBlockBeingWrittenLength = fetchLocatedBlocksAndGetLastBlockLength();
      } else {
        break;
      }
      retriesForLastBlockLength--;
    }
    if (retriesForLastBlockLength == 0) {
      throw new IOException("Could not obtain the last block locations.");
    }
  }
}

对应流程图步骤二的getBlockLocations方法,详情请看fetchLocatedBlocksAndGetLastBlockLength()方法。

private long fetchLocatedBlocksAndGetLastBlockLength() throws IOException {
//调用DFSClient的getLocatedBlocks方法,通过文件路径获取blcok存储位置信息
  final LocatedBlocks newInfo = dfsClient.getLocatedBlocks(src, 0);
  if (DFSClient.LOG.isDebugEnabled()) {
    DFSClient.LOG.debug("newInfo = " + newInfo);
  }
  if (newInfo == null) {
    throw new IOException("Cannot open filename " + src);
  }
  if (locatedBlocks != null) {
    Iterator<LocatedBlock> oldIter = locatedBlocks.getLocatedBlocks().iterator();
    Iterator<LocatedBlock> newIter = newInfo.getLocatedBlocks().iterator();
    while (oldIter.hasNext() && newIter.hasNext()) {
      if (! oldIter.next().getBlock().equals(newIter.next().getBlock())) {
        throw new IOException("Blocklist for " + src + " has changed!");
      }
    }
  }
  locatedBlocks = newInfo;
  long lastBlockBeingWrittenLength = 0;
  if (!locatedBlocks.isLastBlockComplete()) {
    final LocatedBlock last = locatedBlocks.getLastLocatedBlock();
    if (last != null) {
      if (last.getLocations().length == 0) {
        if (last.getBlockSize() == 0) {
          // if the length is zero, then no data has been written to
          // datanode. So no need to wait for the locations.
          return 0;
        }
        return -1;
      }
      final long len = readBlockLength(last);
      last.getBlock().setNumBytes(len);
      lastBlockBeingWrittenLength = len; 
    }
  }


  fileEncryptionInfo = locatedBlocks.getFileEncryptionInfo();


  return lastBlockBeingWrittenLength;
}

重点关注DFSClient的getBlockLocations()方法,从namenode获取block位置信息。

public LocatedBlocks getLocatedBlocks(String src, long start) throws IOException {
   return getLocatedBlocks(src, start, dfsClientConf.prefetchSize);
}


/*
 * This is just a wrapper around callGetBlockLocations, but non-static so that
 * we can stub it out for tests.
 */
@VisibleForTesting
public LocatedBlocks getLocatedBlocks(String src, long start, long length) throws IOException {
   TraceScope scope = getPathTraceScope("getBlockLocations", src);
   try {
      return callGetBlockLocations(namenode, src, start, length);
   } finally {
      scope.close();
   }
}


/**
 * @see ClientProtocol#getBlockLocations(String, long, long)
 */
static LocatedBlocks callGetBlockLocations(ClientProtocol namenode, String src, long start, long length)
      throws IOException {
   try {
   //通过RPC远程调用NameNodeRPCServer
      return namenode.getBlockLocations(src, start, length);
   } catch (RemoteException re) {
      throw re.unwrapRemoteException(AccessControlException.class, FileNotFoundException.class,
            UnresolvedPathException.class);
   }
}


@Idempotent
public LocatedBlocks getBlockLocations(String src,
                                       long offset,
                                       long length) 
    throws AccessControlException, FileNotFoundException,
    UnresolvedLinkException, IOException;

可以看到返回的是LocatedBlocks对象,包含了List<LocatedBlock> blocks,封装了block的信息,以及block在文件中的偏移量,还有block对应DataNode的位置信息。原理上是RPC调用了NameNodeRPCServer的getBlockLocations()方法。

@Override // ClientProtocol
public LocatedBlocks getBlockLocations(String src, 
                                        long offset, 
                                        long length) 
    throws IOException {
  //检查NameNode是否已经启动
  checkNNStartup();
  //计算获取到block信息并记录变化
  metrics.incrGetBlockLocations();
  return namesystem.getBlockLocations(getClientMachine(), 
                                      src, offset, length);
}


/**
 * Get block locations within the specified range.
 * @see ClientProtocol#getBlockLocations(String, long, long)
 */
LocatedBlocks getBlockLocations(String clientMachine, String src,
    long offset, long length) throws IOException {
  checkOperation(OperationCategory.READ);
  //创建Block信息结果对象
  GetBlockLocationsResult res = null;
  readLock();
  try {
    checkOperation(OperationCategory.READ);
    //获取block位置信息
    res = getBlockLocations(src, offset, length, true, true);
  } catch (AccessControlException e) {
    logAuditEvent(false, "open", src);
    throw e;
  } finally {
    readUnlock();
  }
  logAuditEvent(true, "open", src);
  if (res.updateAccessTime()) {
    writeLock();
    final long now = now();
    try {
      checkOperation(OperationCategory.WRITE);
      INode inode = res.iip.getLastINode();
      boolean updateAccessTime = now > inode.getAccessTime() +
          getAccessTimePrecision();
      if (!isInSafeMode() && updateAccessTime) {
        boolean changed = FSDirAttrOp.setTimes(dir,
            inode, -1, now, false, res.iip.getLatestSnapshotId());
        if (changed) {
          getEditLog().logTimes(src, -1, now);
        }
      }
    } catch (Throwable e) {
      LOG.warn("Failed to update the access time of " + src, e);
    } finally {
      writeUnlock();
    }
  }
  //将获取到的block信息赋值给LocatedBlocks 
  LocatedBlocks blocks = res.blocks;
  if (blocks != null) {
    blockManager.getDatanodeManager().sortLocatedBlocks(
        clientMachine, blocks.getLocatedBlocks());
    // lastBlock is not part of getLocatedBlocks(), might need to sort it too
    //获取到最后一个Block的位置信息
    LocatedBlock lastBlock = blocks.getLastLocatedBlock();
    if (lastBlock != null) {
      ArrayList<LocatedBlock> lastBlockList = Lists.newArrayList(lastBlock);
      blockManager.getDatanodeManager().sortLocatedBlocks(
          clientMachine, lastBlockList);
    }
  }
  //返回LocatedBlocks对象,封装了目标文件包含的所有block的位置信息
  return blocks;
}

以上就完成了步骤二获取block位置信息的分析,同样的将返回的DFSInputStream对象传递给createWrappedInputStream(...)方法中进行再次封装。接下来根据NameNode返回的LocatedBlocks对象信息,请求FSDataInputStream的 read()方法。

hdfs fuse源码 hdfs源码剖析_hdfs fuse源码_03

/**
 * Read bytes from the given position in the stream to the given buffer.
 *
 * @param position  position in the input stream to seek
 * @param buffer    buffer into which data is read
 * @param offset    offset into the buffer in which data is written
 * @param length    maximum number of bytes to read
 * @return total number of bytes read into the buffer, or <code>-1</code>
 *         if there is no more data because the end of the stream has been
 *         reached
 */
@Override
public int read(long position, byte[] buffer, int offset, int length)
  throws IOException {
  return ((PositionedReadable)in).read(position, buffer, offset, length);
}

FSDataInputStream会调用其封装的DFSInputStream的read(xxx)方法。

/**
 * Read bytes starting from the specified position.
 * 
 * @param position start read from this position
 * @param buffer read buffer
 * @param offset offset into buffer
 * @param length number of bytes to read
 * 
 * @return actual number of bytes read
 */
@Override
public int read(long position, byte[] buffer, int offset, int length)
    throws IOException {
  TraceScope scope =
      dfsClient.getPathTraceScope("DFSInputStream#byteArrayPread", src);
  try {
    return pread(position, buffer, offset, length);
  } finally {
    scope.close();
  }
}




private int pread(long position, byte[] buffer, int offset, int length)
    throws IOException {
  // sanity checks,检查文件系统是否运行中
  dfsClient.checkOpen();
  if (closed.get()) {
    throw new IOException("Stream closed");
  }
  failures = 0;
  //获取LocatedBlocks的长度
  long filelen = getFileLength();
  if ((position < 0) || (position >= filelen)) {
    return -1;
  }
  int realLen = length;
  if ((position + length) > filelen) {
    realLen = (int)(filelen - position);
  }
  
  // determine the block and byte range within the block
  // corresponding to position and realLen
  //得到从offset到offset + length范围内的block列表
  List<LocatedBlock> blockRange = getBlockRange(position, realLen);
  int remaining = realLen;
  Map<ExtendedBlock,Set<DatanodeInfo>> corruptedBlockMap 
    = new HashMap<ExtendedBlock, Set<DatanodeInfo>>();
   //对block列表进行遍历,读取需要的block数据,因为需要的数据不一定是存在一个block列表中,通常分布在多个block
  for (LocatedBlock blk : blockRange) {
    long targetStart = position - blk.getStartOffset();
    long bytesToRead = Math.min(remaining, blk.getBlockSize() - targetStart);
    try {
      if (dfsClient.isHedgedReadsEnabled()) {
        hedgedFetchBlockByteRange(blk, targetStart, targetStart + bytesToRead
            - 1, buffer, offset, corruptedBlockMap);
      } else {
        fetchBlockByteRange(blk, targetStart, targetStart + bytesToRead - 1,
            buffer, offset, corruptedBlockMap);
      }
    } finally {
      // Check and report if any block replicas are corrupted.
      // BlockMissingException may be caught if all block replicas are
      // corrupted.
      reportCheckSumFailure(corruptedBlockMap, blk.getLocations().length);
    }
    remaining -= bytesToRead;
    position += bytesToRead;
    offset += bytesToRead;
  }
  assert remaining == 0 : "Wrong number of bytes read.";
  if (dfsClient.stats != null) {
    dfsClient.stats.incrementBytesRead(realLen);
  }
  return realLen;
}

分析getBlockRange(xxx)方法,通过指定的范围从namenode获取数据,优先从缓存中获取。

/**
 * Get blocks in the specified range.
 * Fetch them from the namenode if not cached. This function
 * will not get a read request beyond the EOF.
 * @param offset starting offset in file
 * @param length length of data
 * @return consequent segment of located blocks
 * @throws IOException
 */
private List<LocatedBlock> getBlockRange(long offset,
    long length)  throws IOException {
  // getFileLength(): returns total file length
  // locatedBlocks.getFileLength(): returns length of completed blocks
  //通常offset是要小于文件长度的
  if (offset >= getFileLength()) {
    throw new IOException("Offset: " + offset +
      " exceeds file length: " + getFileLength());
  }
  //之前有说到,block的状态有两种,一种是complete写入完成的,另一种是uncomplete构建中的状态
  synchronized(infoLock) {
    final List<LocatedBlock> blocks;
    //得到locatedBlocks的长度
    final long lengthOfCompleteBlk = locatedBlocks.getFileLength();
    final boolean readOffsetWithinCompleteBlk = offset < lengthOfCompleteBlk;
    final boolean readLengthPastCompleteBlk = offset + length > lengthOfCompleteBlk;


    if (readOffsetWithinCompleteBlk) {
      //get the blocks of finalized (completed) block range,
      blocks = getFinalizedBlockRange(offset,
        Math.min(length, lengthOfCompleteBlk - offset));
    } else {
      blocks = new ArrayList<LocatedBlock>(1);
    }


    // get the blocks from incomplete block range
    if (readLengthPastCompleteBlk) {
       blocks.add(locatedBlocks.getLastLocatedBlock());
    }


    return blocks;
  }
}


/**
 * Get blocks in the specified range.
 * Includes only the complete blocks.
 * Fetch them from the namenode if not cached.
 */
private List<LocatedBlock> getFinalizedBlockRange(
    long offset, long length) throws IOException {
  synchronized(infoLock) {
    assert (locatedBlocks != null) : "locatedBlocks is null";
    List<LocatedBlock> blockRange = new ArrayList<LocatedBlock>();
    // search cached blocks first
    //首先会先从缓存的locatedBlocks中查找offset所在的block在缓存链表中的位置
    int blockIdx = locatedBlocks.findBlock(offset);
    if (blockIdx < 0) { // block is not cached,无缓存
      blockIdx = LocatedBlocks.getInsertIndex(blockIdx);
    }
    long remaining = length;
    long curOff = offset;
    while(remaining > 0) {
      LocatedBlock blk = null;
      if(blockIdx < locatedBlocks.locatedBlockCount())
        //根据blcokIdx找到block
        blk = locatedBlocks.get(blockIdx);
        //说明没有缓存,从NameNode查找block并添加到缓存
      if (blk == null || curOff < blk.getStartOffset()) {
        LocatedBlocks newBlocks;
        newBlocks = dfsClient.getLocatedBlocks(src, curOff, remaining);
        locatedBlocks.insertRange(blockIdx, newBlocks.getLocatedBlocks());
        continue;
      }
      assert curOff >= blk.getStartOffset() : "Block not found";
      blockRange.add(blk);
      long bytesRead = blk.getStartOffset() + blk.getBlockSize() - curOff;
      remaining -= bytesRead;
      curOff += bytesRead;
      //继续读取下一个block
      blockIdx++;
    }
    return blockRange;
  }
}

其中我们看一下pread(xxx)方法下引用的fetchBlockByteRange(...)方法。

private void fetchBlockByteRange(LocatedBlock block, long start, long end,
    byte[] buf, int offset,
    Map<ExtendedBlock, Set<DatanodeInfo>> corruptedBlockMap)
    throws IOException {
    //通过偏移量获取LocatedBlock 对象
  block = getBlockAt(block.getStartOffset());
  //熟悉的while循环,为了最大程度上保证成功获取数据
  while (true) {
  //选择就近的一个DataNode进行读取
    DNAddrPair addressPair = chooseDataNode(block, null);
    try {
   //通过选择的DataNode,根据block的起始偏移量开始获取数据,获取完成后return;结束循环
      actualGetFromOneDataNode(addressPair, block, start, end, buf, offset,
          corruptedBlockMap);
      return;
    } catch (IOException e) {
      // Ignore. Already processed inside the function.
      // Loop through to try the next node.
    }
  }
}

执行完actualGetFromOneDataNode(...)方法获取完数据之后会执行close()方法结束连接。

private void actualGetFromOneDataNode(final DNAddrPair datanode,
    LocatedBlock block, final long start, final long end, byte[] buf,
    int offset, Map<ExtendedBlock, Set<DatanodeInfo>> corruptedBlockMap)
    throws IOException {
  DFSClientFaultInjector.get().startFetchFromDatanode();
  int refetchToken = 1; // only need to get a new access token once
  int refetchEncryptionKey = 1; // only need to get a new encryption key once


  while (true) {
    // cached block locations may have been updated by chooseDataNode()
    // or fetchBlockAt(). Always get the latest list of locations at the
    // start of the loop.
    CachingStrategy curCachingStrategy;
    boolean allowShortCircuitLocalReads;
    block = getBlockAt(block.getStartOffset());
    synchronized(infoLock) {
      curCachingStrategy = cachingStrategy;
      allowShortCircuitLocalReads = !shortCircuitForbidden();
    }
    DatanodeInfo chosenNode = datanode.info;
    InetSocketAddress targetAddr = datanode.addr;
    StorageType storageType = datanode.storageType;
    //初始化BlockReader
    BlockReader reader = null;


    try {
      DFSClientFaultInjector.get().fetchFromDatanodeException();
      Token<BlockTokenIdentifier> blockToken = block.getBlockToken();
      int len = (int) (end - start + 1);
      //reader负责从DataNode读取数据,构建socket连接到DataNode
      reader = new BlockReaderFactory(dfsClient.getConf()).
          setInetSocketAddress(targetAddr).
          setRemotePeerFactory(dfsClient).
          setDatanodeInfo(chosenNode).
          setStorageType(storageType).
          setFileName(src).
          setBlock(block.getBlock()).
          setBlockToken(blockToken).
          setStartOffset(start).
          setVerifyChecksum(verifyChecksum).
          setClientName(dfsClient.clientName).
          setLength(len).
          setCachingStrategy(curCachingStrategy).
          setAllowShortCircuitLocalReads(allowShortCircuitLocalReads).
          setClientCacheContext(dfsClient.getClientContext()).
          setUserGroupInformation(dfsClient.ugi).
          setConfiguration(dfsClient.getConfiguration()).
          build();
       //读取数据
      int nread = reader.readAll(buf, offset, len);
      updateReadStatistics(readStatistics, nread, reader);


      if (nread != len) {
        throw new IOException("truncated return from reader.read(): " +
                              "excpected " + len + ", got " + nread);
      }
      DFSClientFaultInjector.get().readFromDatanodeDelay();
      return;
    } catch (ChecksumException e) {
      String msg = "fetchBlockByteRange(). Got a checksum exception for "
          + src + " at " + block.getBlock() + ":" + e.getPos() + " from "
          + chosenNode;
      DFSClient.LOG.warn(msg);
      // we want to remember what we have tried
      
      addIntoCorruptedBlockMap(block.getBlock(), chosenNode, corruptedBlockMap);
      //如果读取失败,则将该DataNode标记位异常节点
      addToDeadNodes(chosenNode);
      throw new IOException(msg);
    } catch (IOException e) {
      if (e instanceof InvalidEncryptionKeyException && refetchEncryptionKey > 0) {
        DFSClient.LOG.info("Will fetch a new encryption key and retry, " 
            + "encryption key was invalid when connecting to " + targetAddr
            + " : " + e);
        // The encryption key used is invalid.
        refetchEncryptionKey--;
        dfsClient.clearDataEncryptionKey();
        continue;
      } else if (refetchToken > 0 && tokenRefetchNeeded(e, targetAddr)) {
        refetchToken--;
        try {
          fetchBlockAt(block.getStartOffset());
        } catch (IOException fbae) {
          // ignore IOE, since we can retry it later in a loop
        }
        continue;
      } else {
        String msg = "Failed to connect to " + targetAddr + " for file "
            + src + " for block " + block.getBlock() + ":" + e;
        DFSClient.LOG.warn("Connection failure: " + msg, e);
        addToDeadNodes(chosenNode);
        throw new IOException(msg);
      }
    } finally {
      if (reader != null) {
        reader.close();
      }
    }
  }
}

本文分析了HDFS写数据流程,从HDFS客户端调用DistributedFileSystem 和FSDataInputStream这两个核心类的方法,通过NameNodeRPC调用NameNode对应的实现方法,获取目标block数据块的元数据信息,通过得到的元数据信息从对应的DataNode节点读取数据,直到读完最后一个block,关闭DataNode和NameNode之间的数据流,完成数据读取。至此完成了HDFS读取数据核心源码的剖析,可以与上一篇文章《Hadoop核心源码剖析(写数据)》串起来一起读,或许会有更多的收获,有关更多细节可以自己动手深究。

总结

本篇文章也是Hadoop HDFS核心源码剖析的完结篇,HDFS主要的核心功能模块基本都涉及到了.