源码参考hadoop-3.3.0,主要流程做解释,欢迎指正。
1 hadoop租约概述
本文书接上文,在完成创建INodeFile的过程中,会进行租约的添加(过程中是给指定文件添加一个租约),在FSDirWriteFileOp.startFile方法中:
// leaseManager是LeaseManager的实例
fsn.leaseManager.addLease(
newNode.getFileUnderConstructionFeature().getClientName(),
newNode.getId());
而后会在DFSClient#create方法中在DFSOutputStream对象创建完成之后开启租约或者进行租约的跟新
// 开启文件租约或者续约,这个过程中会使用rpc访问NameNodeRpcServer,
// 最终使用LeaseManager完成续约
beginFileLease(result.getFileId(), result);
在此之前,先了解一下hadoop的租约。
租约是Namenode给予租约持有者(LeaseHolder,一般是客户端)在规定时间内拥有文件权限(写文件)的合同。
HDFS文件是write-once-read-many, 并且不支持客户端的并行写操作。 HDFS提供了租约(Lease) 机制保证对HDFS文件的互斥操作来实现这个功能,
在HDFS中, 客户端写文件时需要先从租约管理器(LeaseManager) 申请一个租约,成功申请租约之后客户端就成为了租约持有者, 也就拥有了对该HDFS文件的独占权限,其他客户端在该租约有效时无法打开这个HDFS文件进行操作。 Namenode的租约管理器保存了HDFS文件与租约、 租约与租约持有者的对应关系, 租约管理器还会定期检查它维护的所有租约是否过期。 租约管理器会强制收回过期的租约, 所以租约持有者需要定期更新租约(renew), 维护对该文件的独占锁定。 当客户端完成了对文件的写操作, 关闭文件时, 必须在租约管理器中释放租。
2 LeaseManager
在hadoop之中,所有的租约通过LeaseManager进行管理。
它不仅仅保存了HDFS中所有租约的信息, 提供租约的增、 删、 改、 查方法, 同时还维护了一个Monitor线程定期检查租约是否超时, 对于长时间没有更新租约的文件(超过硬限制时间) , LeaseManager会触发约恢复机制, 然后关闭文件。
在LeaseManager中使用数据结构leases、sortedLeases以及sortedLeasesByPath三个字段保存Namenode中的所有租约;使用Imthread字段保存租约检查线程;使用softLimit字段保存软限制时间(默认是60秒, 不可以配置) ; 使用hadrLimit字段保存硬限制时间(默认是20分钟, 可以配置)
2.1 变量与构造函数
上文提及的变量如下:
private final FSNamesystem fsnamesystem;
// 软限制,默认60s
private long softLimit = HdfsConstants.LEASE_SOFTLIMIT_PERIOD;
// 硬限制,在构造函数中初始化为20min
private long hardLimit;
static final int INODE_FILTER_WORKER_COUNT_MAX = 4;
static final int INODE_FILTER_WORKER_TASK_MIN = 512;
// 租约持有者更新时间
private long lastHolderUpdateTime;
private String internalLeaseHolder;
// Used for handling lock-leases
// Mapping: leaseHolder -> Lease
// 排序map,key为租约持有这holder,value为租约
private final SortedMap<String, Lease> leases = new TreeMap<>();
// Set of: Lease,排序租约,根据租约的最后更新时间进行排序,
// 如果更新时间相同, 则按照租约持有者的字典序保存
private final NavigableSet<Lease> sortedLeases = new TreeSet<>(
new Comparator<Lease>() {
@Override
public int compare(Lease o1, Lease o2) {
if (o1.getLastUpdate() != o2.getLastUpdate()) {
return Long.signum(o1.getLastUpdate() - o2.getLastUpdate());
} else {
return o1.holder.compareTo(o2.holder);
}
}
});
// 根据inodeId->Lease保存租约
// INodeID -> Lease
private final TreeMap<Long, Lease> leasesById = new TreeMap<>();
// 租约检查线程
private Daemon lmthread;
private volatile boolean shouldRunMonitor;
构造函数:最初调用在FSNamesystem对象之中
LeaseManager(FSNamesystem fsnamesystem) {
Configuration conf = new Configuration();
this.fsnamesystem = fsnamesystem;
this.hardLimit = conf.getLong(DFSConfigKeys.DFS_LEASE_HARDLIMIT_KEY,
DFSConfigKeys.DFS_LEASE_HARDLIMIT_DEFAULT) * 1000;
// 更新内部租约的时间戳
updateInternalLeaseHolder();
}
2.2 添加租约
当客户端创建或者追加写入一个文件时,会添加一个租约到租约管理器上,调用LeaseManager.addLease()为该客户端在HDFS文件上添加一个租约。 addLease()方法有两个参数, 其中holder参数保存租约的持有者信息,inodeId 代表文件的id, addLease()方法的实现也非常简单, 先是通过getLease()方法构造租约, 然后在LeaseManager定义的leasesById中添加这个租约的信息。
这个方法是同步[synchronized ]的方法
/**
* Adds (or re-adds) the lease for the specified file.
*/
synchronized Lease addLease(String holder, long inodeId) {
// 获取租约
Lease lease = getLease(holder);
// 如果租约不存在,则重新创建,而后添加到sortedLeases和leases中
// 如果租约已存在,则对其进行更新
if (lease == null) {
lease = new Lease(holder);
leases.put(holder, lease);
sortedLeases.add(lease);
} else {
renewLease(lease);
}
leasesById.put(inodeId, lease);
// 往该租约管理的文件中将增加的这个文件加入其中
lease.files.add(inodeId);
return lease;
}
添加租约除了在添加文件时会被调用,在fsimage和editlog加载时也会调用,即从fsimage将inode信息添加到文件目录树中以及从editlog中将OP_ADD操作添加文件。
2.3 检查租约
检查租约主要通过FsNamesystem.checkLease()实现
INodeFile checkLease(INodesInPath iip, String holder, long fileId)
throws LeaseExpiredException, FileNotFoundException {
String src = iip.getPath();
INode inode = iip.getLastINode();
assert hasReadLock();
// HDFS文件不存在, 则抛出异常
if (inode == null) {
throw new FileNotFoundException("File does not exist: "
+ leaseExceptionString(src, fileId, holder));
}
// INode是一个目录, 则抛出异常
if (!inode.isFile()) {
throw new LeaseExpiredException("INode is not a regular file: "
+ leaseExceptionString(src, fileId, holder));
}
final INodeFile file = inode.asFile();
//文件不处于构建中状态, 则抛出异常
if (!file.isUnderConstruction()) {
throw new LeaseExpiredException("File is not open for writing: "
+ leaseExceptionString(src, fileId, holder));
}
// No further modification is allowed on a deleted file.
// A file is considered deleted, if it is not in the inodeMap or is marked
// as deleted in the snapshot feature.
// 文件已被删除,抛出异常
if (isFileDeleted(file)) {
throw new FileNotFoundException("File is deleted: "
+ leaseExceptionString(src, fileId, holder));
}
// 获取文件的租约持有者
final String owner = file.getFileUnderConstructionFeature().getClientName();
// 如果当前租约持有者和文件返回的持有者不相等,则抛出异常
if (holder != null && !owner.equals(holder)) {
throw new LeaseExpiredException("Client (=" + holder
+ ") is not the lease owner (=" + owner + ": "
+ leaseExceptionString(src, fileId, holder));
}
return file;
}
另外在LeaseManager中还存在一个内部线程类Monitor,这个线程定期进行租约的检查,默认是2s,可以通过dfs.namenode.lease-recheck-interval-ms参数配置,启动此线程是在FSNamesystem#startActiveServices中调用LeaseManager#startMonitor方法完成。而在线程的run方法中会调用LeaseManager#checkLeases()方法完成检查:
/** Check the leases beginning from the oldest.
* @return true is sync is needed.
*/
@VisibleForTesting
synchronized boolean checkLeases() {
boolean needSync = false;
assert fsnamesystem.hasWriteLock();
long start = monotonicNow();
//遍历LeaseManager中的所有租约
// 因为sortedLeases 使用了优先级队列,时间最久的租约Lease就在第一个.
// 所以只需要判断第一个租约是否满足过期条件
// 如果租约没有超过硬限制时间, 则直接返回, 因为后面的租约并不需要判断
while(!sortedLeases.isEmpty() &&
sortedLeases.first().expiredHardLimit()
&& !isMaxLockHoldToReleaseLease(start)) {
// 获取时间上看最老的lease
Lease leaseToCheck = sortedLeases.first();
LOG.info("{} has expired hard limit", leaseToCheck);
// 记录租约超时的租约,后续删除
final List<Long> removing = new ArrayList<>();
// need to create a copy of the oldest lease files, because
// internalReleaseLease() removes files corresponding to empty files,
// i.e. it needs to modify the collection being iterated over
// causing ConcurrentModificationException
Collection<Long> files = leaseToCheck.getFiles();
Long[] leaseINodeIds = files.toArray(new Long[files.size()]);
// 获取文件系统根目录
FSDirectory fsd = fsnamesystem.getFSDirectory();
String p = null;
// 获取内部租约持有者
String newHolder = getInternalLeaseHolder();
// 遍历租约中的所有文件
for(Long id : leaseINodeIds) {
try {
// 获取INodeFile文件所在的路径
INodesInPath iip = INodesInPath.fromINode(fsd.getInode(id));
p = iip.getPath();
// Sanity check to make sure the path is correct
if (!p.startsWith("/")) {
throw new IOException("Invalid path in the lease " + p);
}
final INodeFile lastINode = iip.getLastINode().asFile();
// 文件是否已经删除
if (fsnamesystem.isFileDeleted(lastINode)) {
// INode referred by the lease could have been deleted.
removeLease(lastINode.getId());
continue;
}
boolean completed = false;
try {
//调用Fsnamesystem.internalReleaseLease()对文件进行租约恢复,参见2.6
completed = fsnamesystem.internalReleaseLease(
leaseToCheck, p, iip, newHolder);
} catch (IOException e) {
LOG.warn("Cannot release the path {} in the lease {}. It will be "
+ "retried.", p, leaseToCheck, e);
continue;
}
if (LOG.isDebugEnabled()) {
if (completed) {
LOG.debug("Lease recovery for inode {} is complete. File closed"
+ ".", id);
} else {
LOG.debug("Started block recovery {} lease {}", p, leaseToCheck);
}
}
// If a lease recovery happened, we need to sync later.
if (!needSync && !completed) {
needSync = true;
}
} catch (IOException e) {
LOG.warn("Removing lease with an invalid path: {},{}", p,
leaseToCheck, e);
removing.add(id);
}
if (isMaxLockHoldToReleaseLease(start)) {
LOG.debug("Breaking out of checkLeases after {} ms.",
fsnamesystem.getMaxLockHoldToReleaseLeaseMs());
break;
}
}
// 移除已经超时的租约
for(Long id : removing) {
removeLease(leaseToCheck, id);
}
}
return needSync;
}
2.4 更新租约
当客户端打开了一个文件用于写或者追加写操作时, LeaseManager会保存这个客户端在该文件上的租约。 客户端会启动一个LeaseRenewer定期更新租约, 以防止租约过期。
租约更新操作是由FSNamesystem.renewLease()响应的, 这个方法最终会调LeaseManager.renewLease()方法。 renewLease()方法会首先从sortedLeases字段中移除这个租约, 然后更新这个租约的最后更新时间, 再重新加入sortedLeases中。 这么做的原因是, sortedLeases是一个以最后更新时间排序的集合, 所以每次更新租约后, sortedLeases中的顺序也需要重新改变。
/**
* Renew the lease(s) held by the given client
* FSNamesystem#renewLease方法
*/
void renewLease(String holder) throws IOException {
checkOperation(OperationCategory.WRITE);
readLock();
try {
checkOperation(OperationCategory.WRITE);
checkNameNodeSafeMode("Cannot renew lease for " + holder);
// 调用租约管理器中的更新租约操作
leaseManager.renewLease(holder);
} finally {
readUnlock("renewLease");
}
}
/**
* Renew the lease(s) held by the given client
*/
synchronized void renewLease(String holder) {
renewLease(getLease(holder));
}
synchronized void renewLease(Lease lease) {
// 先将租约移除,而后更新租约的时间戳,最后又将租约添加到sortedLeases中
if (lease != null) {
sortedLeases.remove(lease);
lease.renew();
sortedLeases.add(lease);
}
}
2.5 删除租约
LeaseManager中的租约会在两种情况下被删除。
2.5.1 关闭文件
Namenode关闭构建中的HDFS文件时, 会调用FSNamesystem.finalizeINodeFileUnderConstruction()方法将INode从构建状态转换成非构建状态, 同时由于客户端已经完成了文件的写操作, 所以需要从LeaseManager中删除该文件的租约, 这里调用了removeLease()方法删除租约。
void finalizeINodeFileUnderConstruction(String src, INodeFile pendingFile,
int latestSnapshot, boolean allowCommittedBlock) throws IOException {
assert hasWriteLock();
FileUnderConstructionFeature uc = pendingFile.getFileUnderConstructionFeature();
if (uc == null) {
throw new IOException("Cannot finalize file " + src
+ " because it is not under construction");
}
pendingFile.recordModification(latestSnapshot);
// The file is no longer pending.
// Create permanent INode, update blocks. No need to replace the inode here
// since we just remove the uc feature from pendingFile
pendingFile.toCompleteFile(now(),
allowCommittedBlock? numCommittedAllowed: 0,
blockManager.getMinReplication());
// 调用removeLease
leaseManager.removeLease(uc.getClientName(), pendingFile);
// close file and persist block allocations for this file
closeFile(src, pendingFile);
blockManager.checkRedundancy(pendingFile);
}
2.5.2 目录树删除
在进行目录树的删除操作时, 对于已经打开的文件, 如果客户端从文件系统目录树中移出该HDFS文件, 则会调用FSNamesystem.removeLeasesAndINodes()方法从LeaseManager中删除租约。
/**
* Remove leases and inodes related to a given path
* @param removedUCFiles INodes whose leases need to be released
* @param removedINodes Containing the list of inodes to be removed from
* inodesMap
* @param acquireINodeMapLock Whether to acquire the lock for inode removal
*/
void removeLeasesAndINodes(List<Long> removedUCFiles,
List<INode> removedINodes,
final boolean acquireINodeMapLock) {
assert hasWriteLock();
for(long i : removedUCFiles) {
leaseManager.removeLease(i);
}
// remove inodes from inodesMap
if (removedINodes != null) {
if (acquireINodeMapLock) {
dir.writeLock();
}
try {
dir.removeFromInodeMap(removedINodes);
} finally {
if (acquireINodeMapLock) {
dir.writeUnlock();
}
}
removedINodes.clear();
}
}
2.5.3 LeaseManager#removeLease
存在多个重载方法,逻辑其实比较简单,就是从LeaseManager对象中的数据结构变量中移除对应的文件租约。
synchronized void removeLease(long inodeId) {
final Lease lease = leasesById.get(inodeId);
if (lease != null) {
removeLease(lease, inodeId);
}
}
/**
* Remove the specified lease and src.
*/
private synchronized void removeLease(Lease lease, long inodeId) {
leasesById.remove(inodeId);
if (!lease.removeFile(inodeId)) {
LOG.debug("inode {} not found in lease.files (={})", inodeId, lease);
}
if (!lease.hasFiles()) {
leases.remove(lease.holder);
if (!sortedLeases.remove(lease)) {
LOG.error("{} not found in sortedLeases", lease);
}
}
}
/**
* Remove the lease for the specified holder and src
*/
synchronized void removeLease(String holder, INodeFile src) {
Lease lease = getLease(holder);
if (lease != null) {
removeLease(lease, src.getId());
} else {
LOG.warn("Removing non-existent lease! holder={} src={}", holder, src
.getFullPathName());
}
}
synchronized void removeAllLeases() {
sortedLeases.clear();
leasesById.clear();
leases.clear();
}
2.6 租约恢复
对于HDFS文件的租约恢复操作是通过调用FSNamesystem.intemalReleaseLease()实现的, 这个方法用于将一个已经打开的文件进行租约恢复并关闭。 如果成功关闭了文件internalReleaseLease()方法会返回true;如果仅触发了租约恢复操作, 则返回false。 我们知道租约恢复是针对已经打开的构建中的文件的, 所以internalReleaseLease()会判断文件中所有数据块的状态, 对于异常的状态则直接抛出异常。 在checkLeases()方法中, 对于调FSNamesystem.internalReleaseLease()方法时抛出异常的租约, 则直接调用removeLease()方法删除。
当文件处于构建状态时, 有三种情况可以直接关闭文件, 并返回true。
- 这个文件所拥有的所有数据块都处于COMPLETED状态, 也就是客户端还没有来得及关闭文件和释放租约就出现了故障, 这时internalReleaseLease()可以直接调用finalizeINodeFileUnderConstruction()方法关闭文件并删除租约。
- 文件的最后一个数据块处于提交状态( COMMITTED) , 并且该数据块至少有一个有效的副本, 这时可以直接调用finalizeINodeFileUnderConstruction()方法关闭文件并删除租约。
- 文件的最后一个数据块处于构建中状态, 但这个数据块的长度为0, 且当前没有Datanode汇报接收了这个数据块, 这种情况很可能是客户端向数据流管道中写数据前发生了故障, 这时可以将最后一个未写入数据的数据块删除, 之后调用finalizeINodeFileUnderConstruction()方法关闭文件并删除租约。
当最后一个数据块处于UNDER_RECOVERY或者UNDER_CONSTRUCTION状态, 且
这个数据块已经写入了数据时, 则构造一个新的时间戳作为recoveryld, 调用initializeBlockRecovery()触发租约恢复, 更新当前文件的租约持有者为“HDFS_NameNode”。
/**
* Move a file that is being written to be immutable.
* @param src The filename
* @param lease The lease for the client creating the file
* @param recoveryLeaseHolder reassign lease to this holder if the last block
* needs recovery; keep current holder if null.
* @throws AlreadyBeingCreatedException if file is waiting to achieve minimal
* replication;<br>
* RecoveryInProgressException if lease recovery is in progress.<br>
* IOException in case of an error.
* @return true if file has been successfully finalized and closed or
* false if block recovery has been initiated. Since the lease owner
* has been changed and logged, caller should call logSync().
*/
boolean internalReleaseLease(Lease lease, String src, INodesInPath iip,
String recoveryLeaseHolder) throws IOException {
LOG.info("Recovering " + lease + ", src=" + src);
assert !isInSafeMode();
assert hasWriteLock();
final INodeFile pendingFile = iip.getLastINode().asFile();
int nrBlocks = pendingFile.numBlocks();
BlockInfo[] blocks = pendingFile.getBlocks();
int nrCompleteBlocks;
BlockInfo curBlock = null;
for(nrCompleteBlocks = 0; nrCompleteBlocks < nrBlocks; nrCompleteBlocks++) {
curBlock = blocks[nrCompleteBlocks];
if(!curBlock.isComplete())
break;
assert blockManager.hasMinStorage(curBlock) :
"A COMPLETE block is not minimally replicated in " + src;
}
// If there are no incomplete blocks associated with this file,
// then reap lease immediately and close the file.
if(nrCompleteBlocks == nrBlocks) {
finalizeINodeFileUnderConstruction(src, pendingFile,
iip.getLatestSnapshotId(), false);
NameNode.stateChangeLog.warn("BLOCK*" +
" internalReleaseLease: All existing blocks are COMPLETE," +
" lease removed, file " + src + " closed.");
return true; // closed!
}
// Only the last and the penultimate blocks may be in non COMPLETE state.
// If the penultimate block is not COMPLETE, then it must be COMMITTED.
if(nrCompleteBlocks < nrBlocks - 2 ||
nrCompleteBlocks == nrBlocks - 2 &&
curBlock != null &&
curBlock.getBlockUCState() != BlockUCState.COMMITTED) {
final String message = "DIR* NameSystem.internalReleaseLease: "
+ "attempt to release a create lock on "
+ src + " but file is already closed.";
NameNode.stateChangeLog.warn(message);
throw new IOException(message);
}
// The last block is not COMPLETE, and
// that the penultimate block if exists is either COMPLETE or COMMITTED
final BlockInfo lastBlock = pendingFile.getLastBlock();
BlockUCState lastBlockState = lastBlock.getBlockUCState();
BlockInfo penultimateBlock = pendingFile.getPenultimateBlock();
// If penultimate block doesn't exist then its minReplication is met
boolean penultimateBlockMinStorage = penultimateBlock == null ||
blockManager.hasMinStorage(penultimateBlock);
switch(lastBlockState) {
case COMPLETE:
assert false : "Already checked that the last block is incomplete";
break;
case COMMITTED:
// Close file if committed blocks are minimally replicated
if(penultimateBlockMinStorage &&
blockManager.hasMinStorage(lastBlock)) {
finalizeINodeFileUnderConstruction(src, pendingFile,
iip.getLatestSnapshotId(), false);
NameNode.stateChangeLog.warn("BLOCK*" +
" internalReleaseLease: Committed blocks are minimally" +
" replicated, lease removed, file" + src + " closed.");
return true; // closed!
}
// Cannot close file right now, since some blocks
// are not yet minimally replicated.
// This may potentially cause infinite loop in lease recovery
// if there are no valid replicas on data-nodes.
String message = "DIR* NameSystem.internalReleaseLease: " +
"Failed to release lease for file " + src +
". Committed blocks are waiting to be minimally replicated." +
" Try again later.";
NameNode.stateChangeLog.warn(message);
throw new AlreadyBeingCreatedException(message);
case UNDER_CONSTRUCTION:
case UNDER_RECOVERY:
BlockUnderConstructionFeature uc =
lastBlock.getUnderConstructionFeature();
// determine if last block was intended to be truncated
BlockInfo recoveryBlock = uc.getTruncateBlock();
boolean truncateRecovery = recoveryBlock != null;
boolean copyOnTruncate = truncateRecovery &&
recoveryBlock.getBlockId() != lastBlock.getBlockId();
assert !copyOnTruncate ||
recoveryBlock.getBlockId() < lastBlock.getBlockId() &&
recoveryBlock.getGenerationStamp() < lastBlock.getGenerationStamp() &&
recoveryBlock.getNumBytes() > lastBlock.getNumBytes() :
"wrong recoveryBlock";
// setup the last block locations from the blockManager if not known
if (uc.getNumExpectedLocations() == 0) {
uc.setExpectedLocations(lastBlock, blockManager.getStorages(lastBlock),
lastBlock.getBlockType());
}
if (uc.getNumExpectedLocations() == 0 && lastBlock.getNumBytes() == 0) {
// There is no datanode reported to this block.
// may be client have crashed before writing data to pipeline.
// This blocks doesn't need any recovery.
// We can remove this block and close the file.
pendingFile.removeLastBlock(lastBlock);
finalizeINodeFileUnderConstruction(src, pendingFile,
iip.getLatestSnapshotId(), false);
NameNode.stateChangeLog.warn("BLOCK* internalReleaseLease: "
+ "Removed empty last block and closed file " + src);
return true;
}
// Start recovery of the last block for this file
// Only do so if there is no ongoing recovery for this block,
// or the previous recovery for this block timed out.
if (blockManager.addBlockRecoveryAttempt(lastBlock)) {
long blockRecoveryId = nextGenerationStamp(
blockManager.isLegacyBlock(lastBlock));
if(copyOnTruncate) {
lastBlock.setGenerationStamp(blockRecoveryId);
} else if(truncateRecovery) {
recoveryBlock.setGenerationStamp(blockRecoveryId);
}
uc.initializeBlockRecovery(lastBlock, blockRecoveryId, true);
// Cannot close file right now, since the last block requires recovery.
// This may potentially cause infinite loop in lease recovery
// if there are no valid replicas on data-nodes.
NameNode.stateChangeLog.warn(
"DIR* NameSystem.internalReleaseLease: " +
"File " + src + " has not been closed." +
" Lease recovery is in progress. " +
"RecoveryId = " + blockRecoveryId + " for block " + lastBlock);
}
lease = reassignLease(lease, src, recoveryLeaseHolder, pendingFile);
leaseManager.renewLease(lease);
break;
}
return false;
}