List of articles
- 一.前言
- 二.LeaseManager.Lease
- 三.LeaseManager
- 3.1 添加租约——addLease()
- 3.2 检查租约——FsNamesystem.checkLease()
- 3.3 租约更新——renewLease()
- 3.4 删除租约——removeLease()
- 四 租约检查——Monitor线程
- 五 租约恢复——Monitor线程发起
一.前言
租约是Namenode给予租约持有者(LeaseHolder, 一般是客户端) 在规定时间内拥有文件权限(写文件) 的合同。
HDFS文件是write-once-read-many, 并且不支持客户端的并行写操作。 HDFS提供了租约(Lease) 机制保证对HDFS文件的互斥操作来实现这个功能,
在HDFS中, 客户端写文件时需要先从租约管理器(LeaseManager) 申请一个租约,成功申请租约之后客户端就成为了租约持有者, 也就拥有了对该HDFS文件的独占权限,其他客户端在该租约有效时无法打开这个HDFS文件进行操作。 Namenode的租约管理器保存了HDFS文件与租约、 租约与租约持有者的对应关系, 租约管理器还会定期检查它维护的所有租约是否过期。 租约管理器会强制收回过期的租约, 所以租约持有者需要定期更新租约(renew), 维护对该文件的独占锁定。 当客户端完成了对文件的写操作, 关闭文件时, 必须在租约管理器中释放租
LeaseManager进行租赁管理以写入文件。
此类还提供了有用的静态方法来进行租约回收。
租约恢复算法
1)Namenode检索租约信息
2)对于租约中的每个文件f,请考虑f的最后一个块b
2.1)获取包含b的datanodes
2.2)从datanodes中获取一个数据节点作为主数据节点p
2.3)p从namenode获得一个新的时间戳
2.4)p从每个datanode获取块信息
2.5)p计算最小块长度
2.6)p用新的时间戳更新datanodes
2.7)p确认namenode更新结果
2.8)Namenode更新BlockInfo
2.9)Namenode从租约中删除f , 并在删除所有文件后删除租约
2.10)Namenode提交更改以编辑日志
二.LeaseManager.Lease
一个HDFS客户端是可以同时打开多个HDFS文件进行读写操作的, 为了便于管理, 在租约管理器中将一个客户端打开的所有文件组织在一起构成一条记录, 也就是 LeaseManager.Lease类。
holder : 字段保存了客户端也就是租约持有者的信息,
paths : 字段保存了该客户端打开的所有HDFS文件的路径,
lastUpdate : 字段则保存了租约最后的更新时间。
- renew(): renew()方法用于更新客户端的lastUpdate最近更新时间。
- expiredSoftLimit(): 用于判断当前租约是否超出了软限制(softLimit) , 软限制是写文件规定的租约超时时间, 默认是60秒, 不可以配置
- expiredHardLimit(): 用于判断当前租约是否超出了硬限制(hardLimit) , 硬限制是用于考虑文件关闭异常时, 强制回收租约的时间, 默认是60分钟, 不可以配置。 在LeaseManager中有一个内部类用于定期检查租约的更新情况, 当超过硬限制时间时, 会触发租约恢复机制。
三.LeaseManager
LeaseManager是Namenode中维护所有租约操作的类, 它不仅仅保存了HDFS中所有租约的信息, 提供租约的增、 删、 改、 查方法, 同时还维护了一个Monitor线程定期检查租约是否超时, 对于长时间没有更新租约的文件(超过硬限制时间) , LeaseManager会触发约恢复机制, 然后关闭文件。
在LeaseManager中使用数据结构leases、sortedLeases以及sortedLeasesByPath三个字段保存Namenode中的所有租约;
使用Imthread字段保存租约检查线程;
使用softLimit字段保存软限制时间(默认是60秒, 不可以配置) ; 使用hadrLimit字段保存硬限制时间(默认是60分钟, 不可以配置) 。
保存租约的三个字段
■ leases: 保存了租约持有者与租约的对应关系。
■ sortedLeases:
以租约更新时间为顺序保存LeaseManager中的所有租约, 如果更新时间相同, 则按照租约持有者的字典序保存。
■leasesById : 保存了INodeID -> Lease 的对应关系
3.1 添加租约——addLease()
当客户端创建文件和追加写文件时, FSNamesystem.startFileIntemal()以及appendFilelnternal()方法都会调用LeaseManager.addLease()为该客户端在HDFS文件上添加一个租约。 addLease()方法有两个参数, 其中holder参数保存租约的持有者信息,inodeId 代表文件的id, addLease()方法的实现也非常简单, 先是通过getLease()方法构造租约, 然后在LeaseManager定义的保存租约的数据结构中添加这个租约的信息。
这个方法是同步[synchronized ]的方法
/**
* Adds (or re-adds) the lease for the specified file.
*/
synchronized Lease addLease(String holder, long inodeId) {
// 根据client的名字获取Lease信息
Lease lease = getLease(holder);
if (lease == null) {
//构造Lease对象
lease = new Lease(holder);
//在LeaseManager.leases字段中添加Lease对象
leases.put(holder, lease);
//在LeaseManager.sortedLease字段中添加Lease对象
sortedLeases.add(lease);
} else {
renewLease(lease);
}
//保存inodeId信息
leasesById.put(inodeId, lease);
//保存租约中的文件信息
lease.files.add(inodeId);
return lease;
}
addLease()在另外两种情况下也会被调用, 就是Namenode读取fsimage文件时,
fsimage文件记录了当前HDFS文件处于构建状态中, 这时需要重建这个构建中的文件并将文件对应的INode对象加入文件系统目录树中, 然后还需要在LeaseManager中添加租约信息; 以及在Namenode读取editlog时, editlog记录了一个OP_ADD操作, 也就是创建文件的操作, Namenode创建完INode对象并添加到文件系统目录树之后, 还需要在LeaseManager中添加租约信息。
3.2 检查租约——FsNamesystem.checkLease()
INodeFile checkLease(INodesInPath iip, String holder, long fileId)
throws LeaseExpiredException, FileNotFoundException {
String src = iip.getPath();
INode inode = iip.getLastINode();
assert hasReadLock();
// HDFS文件不存在, 则抛出异常
if (inode == null) {
throw new FileNotFoundException("File does not exist: "
+ leaseExceptionString(src, fileId, holder));
}
// INode是一个目录, 则抛出异常
if (!inode.isFile()) {
throw new LeaseExpiredException("INode is not a regular file: "
+ leaseExceptionString(src, fileId, holder));
}
final INodeFile file = inode.asFile();
//文件不处于构建中状态, 则抛出异常
if (!file.isUnderConstruction()) {
throw new LeaseExpiredException("File is not open for writing: "
+ leaseExceptionString(src, fileId, holder));
}
// No further modification is allowed on a deleted file.
// A file is considered deleted, if it is not in the inodeMap or is marked
// as deleted in the snapshot feature.
if (isFileDeleted(file)) {
throw new FileNotFoundException("File is deleted: "
+ leaseExceptionString(src, fileId, holder));
}
final String owner = file.getFileUnderConstructionFeature().getClientName();
// 租约信息 与LeaseManager记录的 不匹配, 则抛出异常
if (holder != null && !owner.equals(holder)) {
throw new LeaseExpiredException("Client (=" + holder
+ ") is not the lease owner (=" + owner + ": "
+ leaseExceptionString(src, fileId, holder));
}
return file;
}
3.3 租约更新——renewLease()
当客户端打开了一个文件用于写或者追加写操作时, LeaseManager会保存这个客户端在该文件上的租约。 客户端会启动一个LeaseRenewer定期更新租约, 以防止租约过期。
租约更新操作是由FSNamesystem.renewLease()响应的, 这个方法最终会调LeaseManager.renewLease()方法。 renewLease()方法会首先从sortedLeases字段中移除这个租约, 然后更新这个租约的最后更新时间, 再重新加入sortedLeases中。 这么做的原因是, sortedLeases是一个以最后更新时间排序的集合, 所以每次更新租约后, sortedLeases中的顺序也需要重新改变。
synchronized void renewLease(Lease lease) {
if (lease != null) {
//释放租约
sortedLeases.remove(lease);
lease.renew();
sortedLeases.add(lease);
}
}
3.4 删除租约——removeLease()
LeaseManager中的租约会在两种情况下被删除。
■ Namenode关闭构建中的HDFS文件时, 会调用FSNamesystem.finalizeINodeFileUnder Construction()方法将INode从构建状态转换成非构建状态, 同时由于客户端已经完成了文件的写操作, 所以需要从LeaseManager中删除该文件的租约, 这里调用了removeLease()方法删除租约。
■ 在进行目录树的删除操作时, 对于已经打开的文件, 如果客户端从文件系统目录树中移出该HDFS文件, 则会调用removeLeaseWithPrefixPath()方法从LeaseManager中删除租约。
// 移除租约
synchronized void removeLease(long inodeId) {
final Lease lease = leasesById.get(inodeId);
if (lease != null) {
removeLease(lease, inodeId);
}
}
/**
* Remove the specified lease and src.
*/
private synchronized void removeLease(Lease lease, long inodeId) {
leasesById.remove(inodeId);
if (!lease.removeFile(inodeId)) {
LOG.debug("inode {} not found in lease.files (={})", inodeId, lease);
}
if (!lease.hasFiles()) {
leases.remove(lease.holder);
if (!sortedLeases.remove(lease)) {
LOG.error("{} not found in sortedLeases", lease);
}
}
}
/**
* Remove the lease for the specified holder and src
*/
synchronized void removeLease(String holder, INodeFile src) {
Lease lease = getLease(holder);
if (lease != null) {
removeLease(lease, src.getId());
} else {
LOG.warn("Removing non-existent lease! holder={} src={}", holder, src
.getFullPathName());
}
}
synchronized void removeAllLeases() {
sortedLeases.clear();
leasesById.clear();
leases.clear();
}
四 租约检查——Monitor线程
租约管理器除了对租约提供增、 删、 改、 查等操作外, 还会定期检查所有租约, 对于长时间没有进行租约更新的文件, LeaseManager会对这个文件进行租约恢复操作, 然后关闭这个文件。
租约的定期检查操作是由LeaseManager的内部类Monitor执行的, Monitor是一个线程类, 它的run()方法会每隔2秒调用一次LeaserManager.checkLeases()方法检查租约。
FSNamesystem#startActiveServices启动
检查周期参数: dfs.namenode.lease-recheck-interval-ms : 默认 2s
/******************************************************
* Monitor checks for leases that have expired,
* and disposes of them.
******************************************************/
class Monitor implements Runnable {
final String name = getClass().getSimpleName();
/** Check leases periodically. */
@Override
public void run() {
for(; shouldRunMonitor && fsnamesystem.isRunning(); ) {
boolean needSync = false;
try {
fsnamesystem.writeLockInterruptibly();
try {
// fsnamesystem 是否是安全模式
if (!fsnamesystem.isInSafeMode()) {
// 检查租约信息
needSync = checkLeases();
}
} finally {
fsnamesystem.writeUnlock("leaseManager");
// lease reassignments should to be sync'ed.
if (needSync) {
fsnamesystem.getEditLog().logSync();
}
}
Thread.sleep(fsnamesystem.getLeaseRecheckIntervalMs());
} catch(InterruptedException ie) {
LOG.debug("{} is interrupted", name, ie);
} catch(Throwable e) {
LOG.warn("Unexpected throwable: ", e);
}
}
}
}
checkLeases()方法会遍历leaseManager中管理的所有租约, 找出所有超过硬限制时间而未更新的租约。 由于租约保存了这个客户端打开的所有HDFS文件, 所以checkLeases()方法会遍历这个租约上的所有文件, 并调用FSNamesystem.internalReleaseLease()方法进行租约恢复操作。
checkLeases()方法的代码如下:
/** Check the leases beginning from the oldest.
* @return true is sync is needed.
*/
@VisibleForTesting
synchronized boolean checkLeases() {
boolean needSync = false;
assert fsnamesystem.hasWriteLock();
long start = monotonicNow();
//遍历LeaseManager中的所有租约
// 因为sortedLeases 使用了优先级队列,时间最久的租约Lease就在第一个.
// 所以只需要判断第一个租约是否满足过期条件
// 如果租约没有超过硬限制时间, 则直接返回, 因为后面的租约并不需要判断
while(!sortedLeases.isEmpty() &&
sortedLeases.first().expiredHardLimit()
&& !isMaxLockHoldToReleaseLease(start)) {
//获取最老的Lease
Lease leaseToCheck = sortedLeases.first();
LOG.info("{} has expired hard limit", leaseToCheck);
//租约超时情况处理
final List<Long> removing = new ArrayList<>();
// need to create a copy of the oldest lease files, because
// internalReleaseLease() removes files corresponding to empty files,
// i.e. it needs to modify the collection being iterated over
// causing ConcurrentModificationException
Collection<Long> files = leaseToCheck.getFiles();
// 遍历超时租约中的所有文件, 对每一个文件进行租约恢复
Long[] leaseINodeIds = files.toArray(new Long[files.size()]);
//获取文件系统根目录
FSDirectory fsd = fsnamesystem.getFSDirectory();
String p = null;
// 获取当前的内部租赁持有人名称。
String newHolder = getInternalLeaseHolder();
for(Long id : leaseINodeIds) {
try {
//获取inode文件所在的路径
INodesInPath iip = INodesInPath.fromINode(fsd.getInode(id));
p = iip.getPath();
// 进行完整性检查以确保路径正确
// Sanity check to make sure the path is correct
if (!p.startsWith("/")) {
throw new IOException("Invalid path in the lease " + p);
}
final INodeFile lastINode = iip.getLastINode().asFile();
// 是否已经删除该文件
if (fsnamesystem.isFileDeleted(lastINode)) {
// INode referred by the lease could have been deleted.
removeLease(lastINode.getId());
continue;
}
boolean completed = false;
try {
//调用Fsnamesystem.internalReleaseLease()对文件进行租约恢复
completed = fsnamesystem.internalReleaseLease(
leaseToCheck, p, iip, newHolder);
} catch (IOException e) {
LOG.warn("Cannot release the path {} in the lease {}. It will be "
+ "retried.", p, leaseToCheck, e);
continue;
}
if (LOG.isDebugEnabled()) {
if (completed) {
LOG.debug("Lease recovery for inode {} is complete. File closed"
+ ".", id);
} else {
LOG.debug("Started block recovery {} lease {}", p, leaseToCheck);
}
}
//由于进行了恢复操作, 需要在editlog中同步记录
// If a lease recovery happened, we need to sync later.
if (!needSync && !completed) {
needSync = true;
}
} catch (IOException e) {
LOG.warn("Removing lease with an invalid path: {},{}", p,
leaseToCheck, e);
//租约恢复出现异常, 则加入removing队列中
removing.add(id);
}
if (isMaxLockHoldToReleaseLease(start)) {
LOG.debug("Breaking out of checkLeases after {} ms.",
fsnamesystem.getMaxLockHoldToReleaseLeaseMs());
break;
}
}
//租约恢复异常, 则直接删除
for(Long id : removing) {
removeLease(leaseToCheck, id);
}
}
return needSync;
}
五 租约恢复——Monitor线程发起
对于HDFS文件的租约恢复操作是通过调用FSNamesystem.intemalReleaseLease()实现的, 这个方法用于将一个已经打开的文件进行租约恢复并关闭。 如果成功关闭了文件internalReleaseLease()方法会返回true;如果仅触发了租约恢复操作, 则返回false。 我们知道租约恢复是针对已经打开的构建中的文件的, 所以internalReleaseLease()会判断文件中所有数据块的状态, 对于异常的状态则直接抛出异常。 在checkLeases()方法中, 对于调FSNamesystem.internalReleaseLease()方法时抛出异常的租约, 则直接调用removeLease()方法删除。
当文件处于构建状态时, 有三种情况可以直接关闭文件, 并返回true。
■ 这个文件所拥有的所有数据块都处于COMPLETED状态, 也就是客户端还没有来得及关闭文件和释放租约就出现了故障, 这时internalReleaseLease()可以直接调用finalizeINodeFileUnderConstruction()方法关闭文件并删除租约。
■ 文件的最后一个数据块处于提交状态( COMMITTED) , 并且该数据块至少有一个有效的副本, 这时可以直接调用finalizeINodeFileUnderConstruction()方法关闭文件并删除租约。
■ 文件的最后一个数据块处于构建中状态, 但这个数据块的长度为0, 且当前没有Datanode汇报接收了这个数据块, 这种情况很可能是客户端向数据流管道中写数据前发生了故障, 这时可以将最后一个未写入数据的数据块删除, 之后调用finalizeINodeFileUnderConstruction()方法关闭文件并删除租约。
当最后一个数据块处于UNDER_RECOVERY或者UNDER_CONSTRUCTION状态, 且
这个数据块已经写入了数据时, 则构造一个新的时间戳作为recoveryld, 调用initializeBlockRecovery()触发租约恢复, 更新当前文件的租约持有者为“HDFS_NameNode”。
fsnamesystem.internalReleaseLease( leaseToCheck, p, iip, newHolder);
/**
* Move a file that is being written to be immutable.
* @param src The filename
* @param lease The lease for the client creating the file
* @param recoveryLeaseHolder reassign lease to this holder if the last block
* needs recovery; keep current holder if null.
* @throws AlreadyBeingCreatedException if file is waiting to achieve minimal
* replication;<br>
* RecoveryInProgressException if lease recovery is in progress.<br>
* IOException in case of an error.
* @return true if file has been successfully finalized and closed or
* false if block recovery has been initiated. Since the lease owner
* has been changed and logged, caller should call logSync().
*/
boolean internalReleaseLease(Lease lease, String src, INodesInPath iip,
String recoveryLeaseHolder) throws IOException {
LOG.info("Recovering " + lease + ", src=" + src);
assert !isInSafeMode();
assert hasWriteLock();
final INodeFile pendingFile = iip.getLastINode().asFile();
int nrBlocks = pendingFile.numBlocks();
BlockInfo[] blocks = pendingFile.getBlocks();
int nrCompleteBlocks;
BlockInfo curBlock = null;
for(nrCompleteBlocks = 0; nrCompleteBlocks < nrBlocks; nrCompleteBlocks++) {
curBlock = blocks[nrCompleteBlocks];
if(!curBlock.isComplete())
break;
assert blockManager.hasMinStorage(curBlock) :
"A COMPLETE block is not minimally replicated in " + src;
}
// If there are no incomplete blocks associated with this file,
// then reap lease immediately and close the file.
if(nrCompleteBlocks == nrBlocks) {
finalizeINodeFileUnderConstruction(src, pendingFile,
iip.getLatestSnapshotId(), false);
NameNode.stateChangeLog.warn("BLOCK*" +
" internalReleaseLease: All existing blocks are COMPLETE," +
" lease removed, file " + src + " closed.");
return true; // closed!
}
// Only the last and the penultimate blocks may be in non COMPLETE state.
// If the penultimate block is not COMPLETE, then it must be COMMITTED.
if(nrCompleteBlocks < nrBlocks - 2 ||
nrCompleteBlocks == nrBlocks - 2 &&
curBlock != null &&
curBlock.getBlockUCState() != BlockUCState.COMMITTED) {
final String message = "DIR* NameSystem.internalReleaseLease: "
+ "attempt to release a create lock on "
+ src + " but file is already closed.";
NameNode.stateChangeLog.warn(message);
throw new IOException(message);
}
// The last block is not COMPLETE, and
// that the penultimate block if exists is either COMPLETE or COMMITTED
final BlockInfo lastBlock = pendingFile.getLastBlock();
BlockUCState lastBlockState = lastBlock.getBlockUCState();
BlockInfo penultimateBlock = pendingFile.getPenultimateBlock();
// If penultimate block doesn't exist then its minReplication is met
boolean penultimateBlockMinStorage = penultimateBlock == null ||
blockManager.hasMinStorage(penultimateBlock);
switch(lastBlockState) {
case COMPLETE:
assert false : "Already checked that the last block is incomplete";
break;
case COMMITTED:
// Close file if committed blocks are minimally replicated
if(penultimateBlockMinStorage &&
blockManager.hasMinStorage(lastBlock)) {
finalizeINodeFileUnderConstruction(src, pendingFile,
iip.getLatestSnapshotId(), false);
NameNode.stateChangeLog.warn("BLOCK*" +
" internalReleaseLease: Committed blocks are minimally" +
" replicated, lease removed, file" + src + " closed.");
return true; // closed!
}
// Cannot close file right now, since some blocks
// are not yet minimally replicated.
// This may potentially cause infinite loop in lease recovery
// if there are no valid replicas on data-nodes.
String message = "DIR* NameSystem.internalReleaseLease: " +
"Failed to release lease for file " + src +
". Committed blocks are waiting to be minimally replicated." +
" Try again later.";
NameNode.stateChangeLog.warn(message);
throw new AlreadyBeingCreatedException(message);
case UNDER_CONSTRUCTION:
case UNDER_RECOVERY:
BlockUnderConstructionFeature uc =
lastBlock.getUnderConstructionFeature();
// determine if last block was intended to be truncated
BlockInfo recoveryBlock = uc.getTruncateBlock();
boolean truncateRecovery = recoveryBlock != null;
boolean copyOnTruncate = truncateRecovery &&
recoveryBlock.getBlockId() != lastBlock.getBlockId();
assert !copyOnTruncate ||
recoveryBlock.getBlockId() < lastBlock.getBlockId() &&
recoveryBlock.getGenerationStamp() < lastBlock.getGenerationStamp() &&
recoveryBlock.getNumBytes() > lastBlock.getNumBytes() :
"wrong recoveryBlock";
// setup the last block locations from the blockManager if not known
if (uc.getNumExpectedLocations() == 0) {
uc.setExpectedLocations(lastBlock, blockManager.getStorages(lastBlock),
lastBlock.getBlockType());
}
if (uc.getNumExpectedLocations() == 0 && lastBlock.getNumBytes() == 0) {
// There is no datanode reported to this block.
// may be client have crashed before writing data to pipeline.
// This blocks doesn't need any recovery.
// We can remove this block and close the file.
pendingFile.removeLastBlock(lastBlock);
finalizeINodeFileUnderConstruction(src, pendingFile,
iip.getLatestSnapshotId(), false);
NameNode.stateChangeLog.warn("BLOCK* internalReleaseLease: "
+ "Removed empty last block and closed file " + src);
return true;
}
// Start recovery of the last block for this file
// Only do so if there is no ongoing recovery for this block,
// or the previous recovery for this block timed out.
if (blockManager.addBlockRecoveryAttempt(lastBlock)) {
long blockRecoveryId = nextGenerationStamp(
blockManager.isLegacyBlock(lastBlock));
if(copyOnTruncate) {
lastBlock.setGenerationStamp(blockRecoveryId);
} else if(truncateRecovery) {
recoveryBlock.setGenerationStamp(blockRecoveryId);
}
uc.initializeBlockRecovery(lastBlock, blockRecoveryId, true);
// Cannot close file right now, since the last block requires recovery.
// This may potentially cause infinite loop in lease recovery
// if there are no valid replicas on data-nodes.
NameNode.stateChangeLog.warn(
"DIR* NameSystem.internalReleaseLease: " +
"File " + src + " has not been closed." +
" Lease recovery is in progress. " +
"RecoveryId = " + blockRecoveryId + " for block " + lastBlock);
}
lease = reassignLease(lease, src, recoveryLeaseHolder, pendingFile);
leaseManager.renewLease(lease);
break;
}
return false;
}
太无聊了,写到这里…