这里以创建目录为例来观察hadoop对文件目录元数据的管理过程,这里首先来个java操作hdfs的小例子:

private boolean mkdirsInt(final String srcArg, PermissionStatus permissions,
    boolean createParent) throws IOException, UnresolvedLinkException {
  String src = srcArg;
  if(NameNode.stateChangeLog.isDebugEnabled()) {
    NameNode.stateChangeLog.debug("DIR* NameSystem.mkdirs: " + src);
  }
  if (!DFSUtil.isValidName(src)) {
    throw new InvalidPathException(src);
  }
  FSPermissionChecker pc = getPermissionChecker();
  checkOperation(OperationCategory.WRITE);
  //以/分割路径为二维数组
  byte[][] pathComponents = FSDirectory.getPathComponentsForReservedPath(src);
  HdfsFileStatus resultingStat = null;
  boolean status = false;
  //加写锁
  writeLock();
  try {
    //检查是否有权限操作
    checkOperation(OperationCategory.WRITE);
    //检查nn是否处于安全模式
    checkNameNodeSafeMode("Cannot create directory " + src);
    src = resolvePath(src, pathComponents);
    //创建文件夹入口
    status = mkdirsInternal(pc, src, permissions, createParent);
    if (status) {
      resultingStat = getAuditFileInfo(src, false);
    }
  } finally {
    // 释放锁
    writeUnlock();
  }
  //写editlog
  getEditLog().logSync();
  if (status) {
    logAuditEvent(true, "mkdirs", srcArg, null, resultingStat);
  }
  return status;
}

在我们使用java调用hadoop api来操作hdfs时候,我们往往需要一个FileSystem类来完成:

// 创建Configuration对象
        Configuration conf=new Configuration();
        // 创建FileSystem对象
        FileSystem fs=FileSystem.get(URI.create(args[0]),conf);

实际上,该类是个抽象类,它有两个具体实现,LocalFileSystem和DistributedFileSystem,前者是本地的一个文件系统,我们显然需要分析DistributedFileSystem。这里以创建一个目录来分析整个流程。首先,从DistributedFileSystem的mkdir开始:

public boolean mkdir(Path f, FsPermission permission) throws IOException {
  return mkdirsInternal(f, permission, false);
}


进入mkdirsInternal方法

private boolean mkdirsInternal(Path f, final FsPermission permission,
      final boolean createParent) throws IOException {
    statistics.incrementWriteOps(1);
    Path absF = fixRelativePart(f);
    return new FileSystemLinkResolver<Boolean>() {
      @Override
      public Boolean doCall(final Path p)
          throws IOException, UnresolvedLinkException {
//       实际调用的创建目录代码
        return dfs.mkdirs(getPathName(p), permission, createParent);
      }

      @Override
      public Boolean next(final FileSystem fs, final Path p)
          throws IOException {
        // FileSystem doesn't have a non-recursive mkdir() method
        // Best we can do is error out
        if (!createParent) {
          throw new IOException("FileSystem does not support non-recursive"
              + "mkdir");
        }
        return fs.mkdirs(p, permission);
      }
    }.resolve(this, absF);
  }

原来这里调用了return dfs.mkdirs(getPathName(p), permission, createParent);方法,这里的dfs是个DFSClient对象,这里我们看类注释就行了:

/********************************************************
 * DFSClient can connect to a Hadoop Filesystem and 
 * perform basic file tasks.  It uses the ClientProtocol
 * to communicate with a NameNode daemon, and connects 
 * directly to DataNodes to read/write block data.
 *
 * Hadoop DFS users should obtain an instance of 
 * DistributedFileSystem, which uses DFSClient to handle
 * filesystem tasks.
 *
 ********************************************************/
@InterfaceAudience.Private
public class DFSClient implements java.io.Closeable, RemotePeerFactory,
    DataEncryptionKeyFactory {

DFSClient就是我们用来与nn或者dn通信的底层实现,就是NN或者DN上的rpcServer的一个RPCClient,我们需要连接hadoop集群的时候,持有的FileSystem(实现类DistributedFileSystem)中会持有DFSClient对象,与hadoop namenode或者datanode通信都是通过该类来实现。那么到这里我们可以画一张图来表示:

hdfs namenode重启之后原本的文件查询不到了 hdfs namenode -initializesharededits_Boo

继续来看DfsClient的mkdir方法:

public boolean mkdirs(String src, FsPermission permission,
    boolean createParent) throws IOException {
  if (permission == null) {
    permission = FsPermission.getDefault();
  }
  FsPermission masked = permission.applyUMask(dfsClientConf.uMask);
  return primitiveMkdir(src, masked, createParent);
}

primitiveMkdir:

public boolean primitiveMkdir(String src, FsPermission absPermission, 
  boolean createParent)
  throws IOException {
  checkOpen();
  if (absPermission == null) {
    absPermission = 
      FsPermission.getDefault().applyUMask(dfsClientConf.uMask);
  } 

  if(LOG.isDebugEnabled()) {
    LOG.debug(src + ": masked=" + absPermission);
  }
  try {
    //namenode是个ClientProtocol对象,这里就是一个rpcclient,该方法将会调用namenodeRpcServer端的对应方法
    return namenode.mkdirs(src, absPermission, createParent);
  } catch(RemoteException re) {
    throw re.unwrapRemoteException(AccessControlException.class,
                                   InvalidPathException.class,
                                   FileAlreadyExistsException.class,
                                   FileNotFoundException.class,
                                   ParentNotDirectoryException.class,
                                   SafeModeException.class,
                                   NSQuotaExceededException.class,
                                   DSQuotaExceededException.class,
                                   UnresolvedPathException.class,
                                   SnapshotAccessControlException.class);
  }
}

这里通过ClientProtocol这个协议的对应方法,底层就会通过RPC网络通信,调用namenode端对应的RPCServer也就是namenodeRpcServer端的对应一个方法。

我们来到服务端,在namenode上运行的RPCServer一直在监听客户端请求,当它监听到客户端发送过来的创建目录的RPC请求时,就会调用对应的方法,我们来看NamenodeRPCServer的对应方法被调用的时候发生的事情:找到namenodeRpcServer类的mkdir方法:

public boolean mkdirs(String src, FsPermission masked, boolean createParent)
    throws IOException {
  checkNNStartup();
  if(stateChangeLog.isDebugEnabled()) {
    stateChangeLog.debug("*DIR* NameNode.mkdirs: " + src);
  }
  if (!checkPathLength(src)) {
    throw new IOException("mkdirs: Pathname too long.  Limit " 
                          + MAX_PATH_LENGTH + " characters, " + MAX_PATH_DEPTH + " levels.");
  }
  return namesystem.mkdirs(src,
      new PermissionStatus(getRemoteUser().getShortUserName(),
          null, masked), createParent);
}

可以看到rpcServer持有一个nameSystem,调用了namesystem的mkdirs方法,而该namesystem就是之前我们在namenode启动时候分析过的FSNamesystem类。该FSNamesystem类维护了namenode所有的元数据。进入namesystem.mkdirs:

boolean mkdirs(String src, PermissionStatus permissions,
    boolean createParent) throws IOException, UnresolvedLinkException {
  boolean ret = false;
  try {
      //创建目录
    ret = mkdirsInt(src, permissions, createParent);
  } catch (AccessControlException e) {
    logAuditEvent(false, "mkdirs", src);
    throw e;
  }
  return ret;
}

进入mkdirsInt:

private boolean mkdirsInt(final String srcArg, PermissionStatus permissions,
    boolean createParent) throws IOException, UnresolvedLinkException {
  String src = srcArg;
  if(NameNode.stateChangeLog.isDebugEnabled()) {
    NameNode.stateChangeLog.debug("DIR* NameSystem.mkdirs: " + src);
  }
  if (!DFSUtil.isValidName(src)) {
    throw new InvalidPathException(src);
  }
  FSPermissionChecker pc = getPermissionChecker();
  checkOperation(OperationCategory.WRITE);
  //以/分割路径为二维数组
  byte[][] pathComponents = FSDirectory.getPathComponentsForReservedPath(src);
  HdfsFileStatus resultingStat = null;
  boolean status = false;
  //加写锁
  writeLock();
  try {
    //检查是否有权限操作
    checkOperation(OperationCategory.WRITE);
    //检查nn是否处于安全模式
    checkNameNodeSafeMode("Cannot create directory " + src);
    src = resolvePath(src, pathComponents);
    //创建文件夹入口
    status = mkdirsInternal(pc, src, permissions, createParent);
    if (status) {
      resultingStat = getAuditFileInfo(src, false);
    }
  } finally {
    // 释放锁
    writeUnlock();
  }
  //写editlog
  getEditLog().logSync();
  if (status) {
    logAuditEvent(true, "mkdirs", srcArg, null, resultingStat);
  }
  return status;
}

这里有两行重要的代码:

1.status = mkdirsInternal(pc, src, permissions, createParent);

2.getEditLog().logSync();

 

首先分析第一行,还是在继续创建目录的过程

private boolean mkdirsInternal(FSPermissionChecker pc, String src, PermissionStatus permissions, boolean createParent)
    throws IOException, UnresolvedLinkException {
  //判断当前线程是否有写锁
  assert hasWriteLock();
  //权限判断
  if (isPermissionEnabled) {
    checkTraverse(pc, src);
  }
  //dir:FSDirectory 包含了文件目录树
  //返回当前文件是否可变的
  if (dir.isDirMutable(src)) {
    // all the users of mkdirs() are used to expect 'true' even if a new directory is not created.
    return true;
  }
  if (isPermissionEnabled) {
    checkAncestorAccess(pc, src, FsAction.WRITE);
  }
  if (!createParent) {
    verifyParentDir(src);
  }

  // validate that we have enough inodes. This is, at best, a 
  // heuristic because the mkdirs() operation might need to 
  // create multiple inodes.
  //检查是否超过了inodes个数上限 heuristic:启发式的
  checkFsObjectLimit();
  //mkdirsRecursively:创建目录
  if (!mkdirsRecursively(src, permissions, false, now())) {
    throw new IOException("Failed to create directory: " + src);
  }
  return true;
}

这里最重要的还是mkdirsRecursively方法:

//  创建目录,如果父目录不存在则会创建父目录
  private boolean mkdirsRecursively(String src, PermissionStatus permissions,
                 boolean inheritPermission, long now)
          throws FileAlreadyExistsException, QuotaExceededException,
                 UnresolvedLinkException, SnapshotAccessControlException,
                 AclException {
    //    如果路径以“/”结束,则把‘/’去掉
    src = FSDirectory.normalizePath(src);
    //拆分目录为二维数组
    byte[][] components = INode.getPathComponents(src);
    final int lastInodeIndex = components.length - 1;
    //加写锁
    dir.writeLock();
    try {
      //通过fsDirectory获取components目录已经存在的路径,比如要创建/a/a/c/v/c.
      // 如果已经存在/a/a/c/目录,那么就只创建后面的目录就行了
      INodesInPath iip = dir.getExistingPathINodes(components);
      if (iip.isSnapshot()) {
        throw new SnapshotAccessControlException(
                "Modification on RO snapshot is disallowed");
      }
      //已经存在的inode路径
      INode[] inodes = iip.getINodes();

      // find the index of the first null in inodes[]
      StringBuilder pathbuilder = new StringBuilder();
      int i = 1;
      //这个循环将会把已经存在的路径加入到pathbuilder中
      for(; i < inodes.length && inodes[i] != null; i++) {
        pathbuilder.append(Path.SEPARATOR).
            append(DFSUtil.bytes2String(components[i]));
        if (!inodes[i].isDirectory()) {
          throw new FileAlreadyExistsException(
                  "Parent path is not a directory: "
                  + pathbuilder + " "+inodes[i].getLocalName());
        }
      }

      // default to creating parent dirs with the given perms
      PermissionStatus parentPermissions = permissions;

      // if not inheriting and it's the last inode, there's no use in
      // computing perms that won't be used
      if (inheritPermission || (i < lastInodeIndex)) {
        // if inheriting (ie. creating a file or symlink), use the parent dir,
        // else the supplied permissions
        // NOTE: the permissions of the auto-created directories violate posix
        FsPermission parentFsPerm = inheritPermission
                ? inodes[i-1].getFsPermission() : permissions.getPermission();

        // ensure that the permissions allow user write+execute
        if (!parentFsPerm.getUserAction().implies(FsAction.WRITE_EXECUTE)) {
          parentFsPerm = new FsPermission(
                  parentFsPerm.getUserAction().or(FsAction.WRITE_EXECUTE),
                  parentFsPerm.getGroupAction(),
                  parentFsPerm.getOtherAction()
          );
        }

        if (!parentPermissions.getPermission().equals(parentFsPerm)) {
          parentPermissions = new PermissionStatus(
                  parentPermissions.getUserName(),
                  parentPermissions.getGroupName(),
                  parentFsPerm
          );
          // when inheriting, use same perms for entire path
          if (inheritPermission) permissions = parentPermissions;
        }
      }
      //这个循环将会把已经存在的路径后面的部分加入到pathbuilder中
      // create directories beginning from the first null index
      for(; i < inodes.length; i++) {
        pathbuilder.append(Path.SEPARATOR).
            append(DFSUtil.bytes2String(components[i]));
        //创建dir逻辑
        dir.unprotectedMkdir(allocateNewInodeId(), iip, i, components[i],
                (i < lastInodeIndex) ? parentPermissions : permissions, null,
                now);
        if (inodes[i] == null) {
          return false;
        }
        // Directory creation also count towards FilesCreated
        // to match count of FilesDeleted metric.
        NameNode.getNameNodeMetrics().incrFilesCreated();

        final String cur = pathbuilder.toString();
        //往editlog中写创建dir的日志
        //输出日志到磁盘上
        getEditLog().logMkDir(cur, inodes[i]);
        if(NameNode.stateChangeLog.isDebugEnabled()) {
          NameNode.stateChangeLog.debug(
                  "mkdirs: created directory " + cur);
        }
      }
    } finally {
      dir.writeUnlock();
    }
    return true;
  }

这里调用了dir的dir.unprotectedMkdir方法,这里的dir是个FSDirectory对象:FSDirectory描述了整个namenode维护的文件目录树:

void unprotectedMkdir(long inodeId, INodesInPath inodesInPath,
      int pos, byte[] name, PermissionStatus permission,
      List<AclEntry> aclEntries, long timestamp)
      throws QuotaExceededException, AclException {
    assert hasWriteLock();
//    封装了一个INodeDirectory对象代表了创建的目录
    final INodeDirectory dir = new INodeDirectory(inodeId, name, permission,
        timestamp);
    //将INodeDirectory挂在文件目录树上
    if (addChild(inodesInPath, pos, dir, true)) {
      if (aclEntries != null) {
        AclStorage.updateINodeAcl(dir, aclEntries, Snapshot.CURRENT_STATE_ID);
      }
      inodesInPath.setINode(pos, dir);
    }
  }

这里就是生成不存在的各个节点,挂在文件目录树上。此时用图来总结:红色的节点是新挂到文件目录树上的节点:

hdfs namenode重启之后原本的文件查询不到了 hdfs namenode -initializesharededits_Boo_02

上面在执行完创建目录后,还有一个方法getEditLog().logMkDir(cur, inodes[i]);首先,getEditLog返回一个FSEditLog对象,该对象维护了namespace变化的日志,元数据每次被更新就会有一个transactionid生成,比如一次mkdir,就会有个transactionid与之对应。可以在/editslog目录下面看到如下格式的editslog和fsimage在里面。第一个文件edits_00000001-00000005.log存放了transactionid为1-5的操作日志。edits_00000006-00000009.log存放了transactionid为6-9的操作日志,从而分段存储。edits_inprogress_0000000455代表了最新的日志到了455这个transactionid,代表了当前正在被写入的文件。而fsimage代表了包含已经合并了多少的transactionid。

hdfs namenode重启之后原本的文件查询不到了 hdfs namenode -initializesharededits_创建目录_03

 然后调用FSEditLog对象的logMkDir方法用于向editlog中记录一条创建目录的日志:

public void logMkDir(String path, INode newNode) {
  PermissionStatus permissions = newNode.getPermissionStatus();
  //构造器模式
  MkdirOp op = MkdirOp.getInstance(cache.get())
    .reset()
    .setInodeId(newNode.getId())
    .setPath(path)
    .setTimestamp(newNode.getModificationTime())
    .setPermissionStatus(permissions);

  AclFeature f = newNode.getAclFeature();
  if (f != null) {
    op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode));
  }

  XAttrFeature x = newNode.getXAttrFeature();
  if (x != null) {
    op.setXAttrs(x.getXAttrs());
  }
  //记录MkdirOp
  logEdit(op);
}

进入logEdit:

void logEdit(final FSEditLogOp op) {
    //写editlog的主要流程,FSEditlog是全局唯一的,保证多线程并发写editlog的时候一定是同步的
    synchronized (this) {
      assert isOpenForWrite() :
        "bad state: " + state;
      
      // wait if an automatic sync is scheduled
      waitIfAutoSyncScheduled();
      //分配唯一的transactionId
      long start = beginTransaction();
      op.setTransactionId(txid);

      try {
        //对外输出操作日志
//        editlog写入本地磁盘文件,并写入journalNode,standby nn会从journalnode同步editlog
        //jornalnode只会写editlog
        editLogStream.write(op);
      } catch (IOException ex) {
        // All journals failed, it is handled in logSync.
      }
//结束当前transaction
      endTransaction(start);
      
      // check if it is time to schedule an automatic sync
      if (!shouldForceSync()) {
        return;
      }
      isAutoSyncScheduled = true;
    }
    
    // sync buffered edit log entries to persistent store
    //将editlog刷到磁盘,先写入内存缓冲,写完之后一次性将内存缓冲的editlog同步到磁盘
    logSync();
  }

editLogStream.write(op);这里会做两件事,一是写道本地磁盘文件,而是写入journeynode。在配置的journeynode目录中可以看到所有的editslog,因为每次写入本地的时候,都会同步journeynode。首先来看long start = beginTransaction();

private long beginTransaction() {
  //当前线程必须持有锁
  assert Thread.holdsLock(this);
  // get a new transactionId
  //对txid递增,每次有新的修改都会递增该值
  txid++;

  //
  // record the transactionId when new data was written to the edits log
  //myTransactionId:ThreadLocal<TransactionId>,从该线程的threadLocal变量中获取transactionid,
  // 也就是每个线程持有自己的transactionid,该transactionid=全局递增的txid,这样只需要管理自己的transactionid
  //其他的线程不会改变该线程对应的transactionid
  TransactionId id = myTransactionId.get();
  id.txid = txid;
  return now();
}

这里主要就是生成transactionid给该线程,这里可以看到threadlocal的用法,每个线程维护自己的threadlocal变量,用的时候就从自己的线程里面取,别的线程修改永远也修改不到threadlocal中的值。由于txid被synchronized修饰的块内,所以txid每次只有一个线程可以做递增操作,将txid赋给transactionid也是线程安全的。beginTransaction()执行完成,为线程生成transaction,回到logEdit方法,op.setTransactionId(txid);将这个递增完成的txid放到FSEditLogOp中。

下面来看editLogStream.write(op);该操作将操作记录写入内存缓冲和journalnode集群中,首先来看editLogStream是个EditLogOutputStream类型,这个类是个抽象类:public abstract class EditLogOutputStream implements Closeable

而它的write方法也是个抽象方法:abstract public void write(FSEditLogOp op) throws IOException;

因此,具体写数据的操作应该在它的子类中实现,那么就要找到具体的实现子类:

editLogStream = journalSet.startLogSegment(segmentTxId,
    NameNodeLayoutVersion.CURRENT_LAYOUT_VERSION);

该段代码中,journalSet是个JournalSet类,代表了Journals的集合,在startLogSegment方法中,该方法返回的是一个JournalSetOutputStream类。

public EditLogOutputStream startLogSegment(final long txId,
    final int layoutVersion) throws IOException {
  mapJournalsAndReportErrors(new JournalClosure() {
    @Override
    public void apply(JournalAndStream jas) throws IOException {
      jas.startLogSegment(txId, layoutVersion);
    }
  }, "starting log segment " + txId);
  return new JournalSetOutputStream();
}

该方法的含义是开启一个新的editslog,那么这个方法是什么被调用的呢。搜索发现FsImage中有调用该方法:

/**
 * Save the contents of the FS image to a new image file in each of the
 * current storage directories.
 */
public synchronized void saveNamespace(FSNamesystem source, NameNodeFile nnf,
    Canceler canceler) throws IOException {
  assert editLog != null : "editLog must be initialized";
  LOG.info("Save namespace ...");
  storage.attemptRestoreRemovedStorage();

  boolean editLogWasOpen = editLog.isSegmentOpen();
  
  if (editLogWasOpen) {
    editLog.endCurrentLogSegment(true);
  }
  long imageTxId = getLastAppliedOrWrittenTxId();
  try {
    saveFSImageInAllDirs(source, nnf, imageTxId, canceler);
    storage.writeAll();
  } finally {
    if (editLogWasOpen) {
      editLog.startLogSegment(imageTxId + 1, true);
      // Take this opportunity to note the current transaction.
      // Even if the namespace save was cancelled, this marker
      // is only used to determine what transaction ID is required
      // for startup. So, it doesn't hurt to update it unnecessarily.
      storage.writeTransactionIdFileToStorage(imageTxId + 1);
    }
  }
}

在前面文章提到的namenode在启动的时候,有个步骤是会将fsimage和editslog合并成为新的fsimage并开启一个新的editslog。在FsNameSystem的loadFSImage方法中:

if (needToSave) {
  //将合并后的fsimage文件写入到配置的多个目录下
  fsImage.saveNamespace(this);
} else {
  updateStorageVersionForRollingUpgrade(fsImage.getLayoutVersion(),
      startOpt);
  // No need to save, so mark the phase done.
  StartupProgress prog = NameNode.getStartupProgress();
  prog.beginPhase(Phase.SAVING_CHECKPOINT);
  prog.endPhase(Phase.SAVING_CHECKPOINT);
}

这段代码中的fsImage.saveNamespace(this)方法:

public synchronized void saveNamespace(FSNamesystem source, NameNodeFile nnf,
    Canceler canceler) throws IOException {
  assert editLog != null : "editLog must be initialized";
  LOG.info("Save namespace ...");
  storage.attemptRestoreRemovedStorage();

  boolean editLogWasOpen = editLog.isSegmentOpen();
  
  if (editLogWasOpen) {
    editLog.endCurrentLogSegment(true);
  }
  long imageTxId = getLastAppliedOrWrittenTxId();
  try {
    saveFSImageInAllDirs(source, nnf, imageTxId, canceler);
    storage.writeAll();
  } finally {
    if (editLogWasOpen) {
        //初始化了segment代码
      editLog.startLogSegment(imageTxId + 1, true);
      // Take this opportunity to note the current transaction.
      // Even if the namespace save was cancelled, this marker
      // is only used to determine what transaction ID is required
      // for startup. So, it doesn't hurt to update it unnecessarily.
      storage.writeTransactionIdFileToStorage(imageTxId + 1);
    }
  }
}

editLog.startLogSegment(imageTxId + 1, true);上面的这段代码调用了startLogSegment方法,触发了初始化:

synchronized void startLogSegment(final long segmentTxId,
    boolean writeHeaderTxn) throws IOException {
  LOG.info("Starting log segment at " + segmentTxId);
  Preconditions.checkArgument(segmentTxId > 0,
      "Bad txid: %s", segmentTxId);
  Preconditions.checkState(state == State.BETWEEN_LOG_SEGMENTS,
      "Bad state: %s", state);
  Preconditions.checkState(segmentTxId > curSegmentTxId,
      "Cannot start writing to log segment " + segmentTxId +
      " when previous log segment started at " + curSegmentTxId);
  Preconditions.checkArgument(segmentTxId == txid + 1,
      "Cannot start log segment at txid %s when next expected " +
      "txid is %s", segmentTxId, txid + 1);
  
  numTransactions = totalTimeTransactions = numTransactionsBatchedInSync = 0;

  // TODO no need to link this back to storage anymore!
  // See HDFS-2174.
  storage.attemptRestoreRemovedStorage();
  
  try {
    //返回的是EditLogOutputStream的子类JournalSetOutputStream实例
    editLogStream = journalSet.startLogSegment(segmentTxId,
        NameNodeLayoutVersion.CURRENT_LAYOUT_VERSION);
  } catch (IOException ex) {
    throw new IOException("Unable to start log segment " +
        segmentTxId + ": too few journals successfully started.", ex);
  }
  
  curSegmentTxId = segmentTxId;
  state = State.IN_SEGMENT;

  if (writeHeaderTxn) {
    logEdit(LogSegmentOp.getInstance(cache.get(),
        FSEditLogOpCodes.OP_START_LOG_SEGMENT));
    logSync();
  }
}

因此,namenode在启动的时候,在fsimage与editslog合并后,就会初始化editLogStream。来看这行代码:

editLogStream = journalSet.startLogSegment(segmentTxId,
    NameNodeLayoutVersion.CURRENT_LAYOUT_VERSION);

调用的是journalSet的startLogSegment方法。再来看journalSet的初始化:

private synchronized void initJournals(List<URI> dirs) {
  int minimumRedundantJournals = conf.getInt(
      DFSConfigKeys.DFS_NAMENODE_EDITS_DIR_MINIMUM_KEY,
      DFSConfigKeys.DFS_NAMENODE_EDITS_DIR_MINIMUM_DEFAULT);

  synchronized(journalSetLock) {
    //初始化journalSet
    journalSet = new JournalSet(minimumRedundantJournals);

    for (URI u : dirs) {
      boolean required = FSNamesystem.getRequiredNamespaceEditsDirs(conf)
          .contains(u);
      //LOCAL_URI_SCHEME:file
      //如果当前是LOCAL_URI_SCHEME是file,是本地系统的话
      if (u.getScheme().equals(NNStorage.LOCAL_URI_SCHEME)) {
        StorageDirectory sd = storage.getStorageDirectory(u);
        if (sd != null) {
          //创建FileJournalManager,就是专门负责将editlog写入本地磁盘的
          journalSet.add(new FileJournalManager(conf, sd, storage),
              required, sharedEditsDirs.contains(u));
        }
      } else {
        //如果不是本地文件系统,就会在这里创建createJournal(u),创建的就是JournalManager,
        //JournalManager负责管理所有的journalnode
        journalSet.add(createJournal(u), required,
            sharedEditsDirs.contains(u));
      }
    }
  }

  if (journalSet.isEmpty()) {
    LOG.error("No edits directories configured!");
  } 
}

journalSet中维护了本地文件和journalnode的列表。

JournalSetOutputStream类定义了对流的write,create,flush等各种操作。来看它的write方法:

public void write(final FSEditLogOp op)
    throws IOException {
  mapJournalsAndReportErrors(new JournalClosure() {
    @Override
    public void apply(JournalAndStream jas) throws IOException {
      if (jas.isActive()) {
        jas.getCurrentStream().write(op);
      }
    }
  }, "write op");
}

这里传给了mapJournalsAndReportErrors方法一个闭包匿名类,目的是为了传递实现子类中不确定的行为apply,顺便来看JournalClosure中的apply方法:

private interface JournalClosure {
  /**
   * The operation on JournalAndStream.
   * @param jas Object on which operations are performed.
   * @throws IOException
   */
  public void apply(JournalAndStream jas) throws IOException;
}

再来看mapJournalsAndReportErrors方法:我们将上面的匿名内部类实现的行为传递给mapJournalsAndReportErrors方法,该方法在for (JournalAndStream jas : journals) {中对journals(List<JournalAndStream>)中每个JournalAndStream都去调用apply中定义的方法。

private void mapJournalsAndReportErrors(
    JournalClosure closure, String status) throws IOException{

  List<JournalAndStream> badJAS = Lists.newLinkedList();
  //journals是个List<JournalAndStream>集合
  for (JournalAndStream jas : journals) {
    try {
      //对于集合中的每一个JournalAndStream,都会调用apply方法
      //也就是上面定义的jas.getCurrentStream().write(op);
      closure.apply(jas);
    } catch (Throwable t) {
      if (jas.isRequired()) {
        final String msg = "Error: " + status + " failed for required journal ("
          + jas + ")";
        LOG.fatal(msg, t);
        // If we fail on *any* of the required journals, then we must not
        // continue on any of the other journals. Abort them to ensure that
        // retry behavior doesn't allow them to keep going in any way.
        abortAllJournals();
        // the current policy is to shutdown the NN on errors to shared edits
        // dir. There are many code paths to shared edits failures - syncs,
        // roll of edits etc. All of them go through this common function 
        // where the isRequired() check is made. Applying exit policy here 
        // to catch all code paths.
        terminate(1, msg);
      } else {
        LOG.error("Error: " + status + " failed for (journal " + jas + ")", t);
        badJAS.add(jas);          
      }
    }
  }
  disableAndReportErrorOnJournals(badJAS);
  if (!NameNodeResourcePolicy.areResourcesAvailable(journals,
      minimumRedundantJournals)) {
    String message = status + " failed for too many journals";
    LOG.error("Error: " + message);
    throw new IOException(message);
  }
}

这里的含义是:journals是journalset类中维护的一个cow集合:List<JournalAndStream> journals =

new CopyOnWriteArrayList<JournalSet.JournalAndStream>();该集合存放了所有要写入的流,在调用mapJournalsAndReportErrors方法的时候就会分别将数据写入这些流中,从而实现了本地文件流和journalnode流的同时写入。

在前面的那个EditLogOutputStream里面,其实是可以封装多个流的,主要是在初始化的时候,有一个JournalSet,FileJournalManager(负责写本地磁盘),QuorumJournalManager(负责写journalnodes,基于JournalSet搞一个EditLogOutputStream出来,然后这个东西底层就封装了多个流,调用write()方法的时候,他其实会在底层遍历所有的流,依次调用这些流,而且这些流,他都是先写入内存缓冲的。然后在内存缓冲都写完了之后,会有另外一个单独的方法,来将内存缓冲中的数据刷入磁盘,或者是刷入网络发送到journal node去。

在上面的代码中,jas包含的流可能是个文件editlog的流,也可能是journalnode的流。jas.getCurrentStream().write(op);

如果写的是editlog的流:也就是EditLogFileOutputStream这个流:

public void write(FSEditLogOp op) throws IOException {
  doubleBuf.writeOp(op);
}

那么调用的是个双缓冲EditsDoubleBuffer类的一个机制:

public void writeOp(FSEditLogOp op) throws IOException {
  bufCurrent.writeOp(op);
}
public void writeOp(FSEditLogOp op) throws IOException {
  if (firstTxId == HdfsConstants.INVALID_TXID) {
    firstTxId = op.txid;
  } else {
    assert op.txid > firstTxId;
  }
  writer.writeOp(op);
  numTxns++;
}

这里说下EditsDoubleBuffer这个双缓存,第一个缓冲是用来写数据,第二个缓冲区是用来刷到磁盘的。每次双缓存第二个缓冲刷到磁盘的时候,两个缓冲区会交换一下。这个可允许editslog持续写入内存缓冲的同时,还能写入到网络和和磁盘中。上面的editLogStream.write(op);执行完成后,就会执行logSync();

/**
 * Sync all modifications done by this thread.
 *
 * The internal concurrency design of this class is as follows:
 *   - Log items are written synchronized into an in-memory buffer,
 *     and each assigned a transaction ID.
 *   - When a thread (client) would like to sync all of its edits, logSync()
 *     uses a ThreadLocal transaction ID to determine what edit number must
 *     be synced to.
 *   - The isSyncRunning volatile boolean tracks whether a sync is currently
 *     under progress.
 *
 * The data is double-buffered within each edit log implementation so that
 * in-memory writing can occur in parallel with the on-disk writing.
 *
 * Each sync occurs in three steps:
 *   1. synchronized, it swaps the double buffer and sets the isSyncRunning
 *      flag.
 *   2. unsynchronized, it flushes the data to storage
 *   3. synchronized, it resets the flag and notifies anyone waiting on the
 *      sync.
 *
 * The lack of synchronization on step 2 allows other threads to continue
 * to write into the memory buffer while the sync is in progress.
 * Because this step is unsynchronized, actions that need to avoid
 * concurrency with sync() should be synchronized and also call
 * waitForSyncToFinish() before assuming they are running alone.
 */
public void logSync() {
  long syncStart = 0;

  // Fetch the transactionId of this thread. 
  long mytxid = myTransactionId.get().txid;
  
  boolean sync = false;
  try {
    EditLogOutputStream logStream = null;
    synchronized (this) {
      try {
        printStatistics(false);

        // if somebody is already syncing, then wait
        //每个线程写完数据都会尝试同步缓存数据到磁盘上去
        //同一时间只能有一个线程执行buffer到磁盘的工作,isSyncRunning代表了标志位,如果有线程正在同步那么这个标志就是个true
        //mytxid > synctxid这个条件说明当前线程大于正在同步数据的线程的txid,可以等,
        // 反之,如果mytxid < synctxid ,说明后面来的线程
        //已经将本线程写入缓存区的数据正在同步,也就是下面的if (mytxid <= synctxid)代码,
        // 此时就什么都不需要做了
        while (mytxid > synctxid && isSyncRunning) {
          try {
            wait(1000);
          } catch (InterruptedException ie) {
          }
        }

        //
        // If this transaction was already flushed, then nothing to do
        //此时说明比当前mytxid更大的id已经将该线程的buffer数据刷到磁盘了
        if (mytxid <= synctxid) {
          //记录同步次数自增
          numTransactionsBatchedInSync++;
          if (metrics != null) {
            // Metrics is non-null only when used inside name node
            metrics.incrTransactionsBatchedInSync();
          }
          //此时直接返回,不需要做什么了
          return;
        }
        //能进入这里,说明mytxid > synctxid && isSyncRunning=false
        // now, this thread will do the sync
        syncStart = txid;
        isSyncRunning = true;
        sync = true;

        // swap buffers
        try {
          if (journalSet.isEmpty()) {
            throw new IOException("No journals available to flush");
          }
          //交换buffer双缓冲
          editLogStream.setReadyToFlush();
        } catch (IOException e) {
          final String msg =
              "Could not sync enough journals to persistent storage " +
              "due to " + e.getMessage() + ". " +
              "Unsynced transactions: " + (txid - synctxid);
          LOG.fatal(msg, new Exception());
          synchronized(journalSetLock) {
            IOUtils.cleanup(LOG, journalSet);
          }
          terminate(1, msg);
        }
      } finally {
        // Prevent RuntimeException from blocking other log edit write 
        doneWithAutoSyncScheduling();
      }
      //editLogStream may become null,
      //so store a local variable for flush.
      logStream = editLogStream;
    }
    
    // do the sync
    long start = now();
    try {
      if (logStream != null) {
        //执行flush操作,此时只有一个线程能够执行该操作,其他线程会被锁住
        logStream.flush();
      }
    } catch (IOException ex) {
      synchronized (this) {
        final String msg =
            "Could not sync enough journals to persistent storage. "
            + "Unsynced transactions: " + (txid - synctxid);
        LOG.fatal(msg, new Exception());
        synchronized(journalSetLock) {
          IOUtils.cleanup(LOG, journalSet);
        }
        terminate(1, msg);
      }
    }
    long elapsed = now() - start;

    if (metrics != null) { // Metrics non-null only when used inside name node
      metrics.addSync(elapsed);
    }
    
  } finally {
    // Prevent RuntimeException from blocking other log edit sync 
    synchronized (this) {
      if (sync) {
        //将synctxid设置为已经同步完成的txid,标志位设为false,并通知等待的线程
        synctxid = syncStart;
        isSyncRunning = false;
      }
      this.notifyAll();
   }
  }
}

在这里,如果比较小的txid过来,会直接return,因为大的txid会将比他小的数据都flush到磁盘中。对于editLogStream.setReadyToFlush();双缓冲交换:

public void setReadyToFlush() {
  assert isFlushed() : "previous data not flushed yet";
  TxnBuffer tmp = bufReady;
  bufReady = bufCurrent;
  bufCurrent = tmp;
}

上面是写磁盘editslog的逻辑,下面来看些journalnode的逻辑,在QuorumJournalManager中。

首先该类有个loggers:

/**
 * 里面封装了多个AscynLogger,在flush的时候,他会通过这个组件,里面的每一个AsyncLogger都会往一个journalnode中发送editslog
 * 他这里会封装一个quorum算法,只要大多数的journalnode都写成功了,就可以
 */
private final AsyncLoggerSet loggers;

该QuorumOutputStream流通过下面方法获取:

public EditLogOutputStream startLogSegment(long txId, int layoutVersion)
    throws IOException {
  Preconditions.checkState(isActiveWriter,
      "must recover segments before starting a new one");
  QuorumCall<AsyncLogger, Void> q = loggers.startLogSegment(txId,
      layoutVersion);
  loggers.waitForWriteQuorum(q, startSegmentTimeoutMs,
      "startLogSegment(" + txId + ")");
  //写到journalnode的流,通过双缓冲机制写入数据
  return new QuorumOutputStream(loggers, txId,
      outputBufferCapacity, writeTxnsTimeoutMs);

推送到journalnode的代码在下面代码中

@Override
protected void flushAndSync(boolean durable) throws IOException {
  int numReadyBytes = buf.countReadyBytes();
  if (numReadyBytes > 0) {
    int numReadyTxns = buf.countReadyTxns();
    long firstTxToFlush = buf.getFirstReadyTxId();

    assert numReadyTxns > 0;

    // Copy from our double-buffer into a new byte array. This is for
    // two reasons:
    // 1) The IPC code has no way of specifying to send only a slice of
    //    a larger array.
    // 2) because the calls to the underlying nodes are asynchronous, we
    //    need a defensive copy to avoid accidentally mutating the buffer
    //    before it is sent.
    DataOutputBuffer bufToSend = new DataOutputBuffer(numReadyBytes);
    buf.flushTo(bufToSend);
    assert bufToSend.getLength() == numReadyBytes;
    byte[] data = bufToSend.getData();
    assert data.length == bufToSend.getLength();
    //每个AsyncLogger都是基于线程池异步发送网络请求,rpc接口调用QuorumCall来获取执行结果
    QuorumCall<AsyncLogger, Void> qcall = loggers.sendEdits(
        segmentTxId, firstTxToFlush,
        numReadyTxns, data);
    //等待大多数的数据发送给journalnode成功,大多数成功就算写journalenode成功
    loggers.waitForWriteQuorum(qcall, writeTimeoutMs, "sendEdits");
    
    // Since we successfully wrote this batch, let the loggers know. Any future
    // RPCs will thus let the loggers know of the most recent transaction, even
    // if a logger has fallen behind.
    loggers.setCommittedTxId(firstTxToFlush + numReadyTxns - 1);
  }
}

首先会将缓存中的数据拷贝到一个新的字节数组里,另外由于异步发送,也是为了防止发送的时候数据被修改了。

public QuorumCall<AsyncLogger, Void> sendEdits(
    long segmentTxId, long firstTxnId, int numTxns, byte[] data) {
  Map<AsyncLogger, ListenableFuture<Void>> calls = Maps.newHashMap();
  for (AsyncLogger logger : loggers) {
    //每个AsyncLogger对应一个journalnode,并向其发送数据
    ListenableFuture<Void> future = 
      logger.sendEdits(segmentTxId, firstTxnId, numTxns, data);
    calls.put(logger, future);
  }
  return QuorumCall.create(calls);
}

QuorumOutputStream这两个类是用来向journalnode发送数据的,它持有一个AsyncLoggerSet类,AsyncLoggerSet类持有一系列的AsyncLogger,每个AsyncLogger对应着一个journalnode。通过AsyncLogger异步发送到每个journalenode,当AsyncLoggerSet中的大多数AsyncLogger发送成功了之后,就认为写入journalenode成功了。AsyncLogger是个接口,其实现类是IPCLoggerChannel类。那么我们来看该实现类的sendEdits方法:

 

public ListenableFuture<Void> sendEdits(
    final long segmentTxId, final long firstTxnId,
    final int numTxns, final byte[] data) {
  try {
    reserveQueueSpace(data.length);
  } catch (LoggerTooFarBehindException e) {
    return Futures.immediateFailedFuture(e);
  }
  
  // When this batch is acked, we use its submission time in order
  // to calculate how far we are lagging.
  final long submitNanos = System.nanoTime();
  
  ListenableFuture<Void> ret = null;
  try {
    // 提交给线程池去执行
    ret = singleThreadExecutor.submit(new Callable<Void>() {
      @Override
      public Void call() throws IOException {
        throwIfOutOfSync();

        long rpcSendTimeNanos = System.nanoTime();
        try {
          //发送数据,getProxy()返回一个QJournalProtocol类。就是一个rpc接口。data是buff缓存中的数据
          getProxy().journal(createReqInfo(),
              segmentTxId, firstTxnId, numTxns, data);
        } catch (IOException e) {
          QuorumJournalManager.LOG.warn(
              "Remote journal " + IPCLoggerChannel.this + " failed to " +
              "write txns " + firstTxnId + "-" + (firstTxnId + numTxns - 1) +
              ". Will try to write to this JN again after the next " +
              "log roll.", e); 
          synchronized (IPCLoggerChannel.this) {
            outOfSync = true;
          }
          throw e;
        } finally {
          long now = System.nanoTime();
          long rpcTime = TimeUnit.MICROSECONDS.convert(
              now - rpcSendTimeNanos, TimeUnit.NANOSECONDS);
          long endToEndTime = TimeUnit.MICROSECONDS.convert(
              now - submitNanos, TimeUnit.NANOSECONDS);
          metrics.addWriteEndToEndLatency(endToEndTime);
          metrics.addWriteRpcLatency(rpcTime);
          if (rpcTime / 1000 > WARN_JOURNAL_MILLIS_THRESHOLD) {
            QuorumJournalManager.LOG.warn(
                "Took " + (rpcTime / 1000) + "ms to send a batch of " +
                numTxns + " edits (" + data.length + " bytes) to " +
                "remote journal " + IPCLoggerChannel.this);
          }
        }
        synchronized (IPCLoggerChannel.this) {
          highestAckedTxId = firstTxnId + numTxns - 1;
          lastAckNanos = submitNanos;
        }
        return null;
      }
    });
  } finally {
    if (ret == null) {
      // it didn't successfully get submitted,
      // so adjust the queue size back down.
      unreserveQueueSpace(data.length);
    } else {
      // It was submitted to the queue, so adjust the length
      // once the call completes, regardless of whether it
      // succeeds or fails.
      Futures.addCallback(ret, new FutureCallback<Void>() {
        @Override
        public void onFailure(Throwable t) {
          unreserveQueueSpace(data.length);
        }

        @Override
        public void onSuccess(Void t) {
          unreserveQueueSpace(data.length);
        }
      });
    }
  }
  return ret;
}

这里最重要的就是 getProxy().journal(createReqInfo(),segmentTxId, firstTxnId, numTxns, data);这个方法了。getProxy()方法获取的是一个QJournalProtocol接口,这个接口显然就是namenode中这个logger与journalenode通信的接口,通过该接口的rpc通信。将数据发送给了journalnode。

namenode发送到journalenode数据,此时namenode是作为rpc的client端与journalnode作为服务端来进行通信,那么我们来看journalnode是如何响应该请求的:找到JournalNodeRpcServer类:通过客户端调用的方法名称知道该方法叫journal方法

@Override
public void journal(RequestInfo reqInfo,
    long segmentTxId, long firstTxnId,
    int numTxns, byte[] records) throws IOException {
  jn.getOrCreateJournal(reqInfo.getJournalId())
     .journal(reqInfo, segmentTxId, firstTxnId, numTxns, records);
}

jn.getOrCreateJournal(reqInfo.getJournalId())返回一个journal类,调用journal类的journal方法:

synchronized void journal(RequestInfo reqInfo,
    long segmentTxId, long firstTxnId,
    int numTxns, byte[] records) throws IOException {
  checkFormatted();
  checkWriteRequest(reqInfo);

  checkSync(curSegment != null,
      "Can't write, no segment open");
  
  if (curSegmentTxId != segmentTxId) {
    // Sanity check: it is possible that the writer will fail IPCs
    // on both the finalize() and then the start() of the next segment.
    // This could cause us to continue writing to an old segment
    // instead of rolling to a new one, which breaks one of the
    // invariants in the design. If it happens, abort the segment
    // and throw an exception.
    JournalOutOfSyncException e = new JournalOutOfSyncException(
        "Writer out of sync: it thinks it is writing segment " + segmentTxId
        + " but current segment is " + curSegmentTxId);
    abortCurSegment();
    throw e;
  }
    
  checkSync(nextTxId == firstTxnId,
      "Can't write txid " + firstTxnId + " expecting nextTxId=" + nextTxId);
  
  long lastTxnId = firstTxnId + numTxns - 1;
  if (LOG.isTraceEnabled()) {
    LOG.trace("Writing txid " + firstTxnId + "-" + lastTxnId);
  }

  // If the edit has already been marked as committed, we know
  // it has been fsynced on a quorum of other nodes, and we are
  // "catching up" with the rest. Hence we do not need to fsync.
  boolean isLagging = lastTxnId <= committedTxnId.get();
  boolean shouldFsync = !isLagging;
  
  curSegment.writeRaw(records, 0, records.length);
  curSegment.setReadyToFlush();
  Stopwatch sw = new Stopwatch();
  sw.start();
  curSegment.flush(shouldFsync);
  sw.stop();
  
  metrics.addSync(sw.elapsedTime(TimeUnit.MICROSECONDS));
  if (sw.elapsedTime(TimeUnit.MILLISECONDS) > WARN_SYNC_MILLIS_THRESHOLD) {
    LOG.warn("Sync of transaction range " + firstTxnId + "-" + lastTxnId +
             " took " + sw.elapsedTime(TimeUnit.MILLISECONDS) + "ms");
  }

  if (isLagging) {
    // This batch of edits has already been committed on a quorum of other
    // nodes. So, we are in "catch up" mode. This gets its own metric.
    metrics.batchesWrittenWhileLagging.incr(1);
  }
  
  metrics.batchesWritten.incr(1);
  metrics.bytesWritten.incr(records.length);
  metrics.txnsWritten.incr(numTxns);
  
  highestWrittenTxId = lastTxnId;
  nextTxId = lastTxnId + 1;
}

对于journalnode来说,接收到namenode发送过来的数据后,会将数据刷到磁盘。

最后做个总结,当客户端发起请求的时候,会通过客户端DfsClient来调用远程namenode的RPCServer,rpcServer持有FsNameSystem类,FsNameSystem类维护了FsEditsLog和FsDirectory两个类。FsEditsLog就是操作日志文件,FsDirectory是内存中的文件目录树,一个rpcserver接收请求后,会首先在FsDirectory这个文件目录树下挂一个节点。然后往FsEditsLog中写数据,该数据会被写入两处,一处是写入本地磁盘文件,一处是写入远程journalnode节点。