HBase中系统故障恢复以及主从复制都基于HLog实现。默认情况下,所有写入操作(写入、更新以及删除)的数据都先以追加形式写入HLog,再写入MemStore。大多数情况下,HLog并不会被读取,但如果RegionServer在某些异常情况下发生宕机,此时已经写入MemStore中但尚未flush到磁盘的数据就会丢失,需要回放HLog补救丢失的数据。此外,HBase主从复制需要主集群将HLog日志发送给从集群,从集群在本地执行回放操作,完成集群之间的数据复制。
1、HLog文件结构
HLog文件的基本结构如图所示。
---每个RegionServer拥有一个或多个HLog。每个HLog是多个Region共享的,图中Region A、Region B和Region C共享一个HLog文件。
---HLog中,日志单元WALEntry(图中小方框)表示一次行级更新的最小追加单元,它由HLogKey和WALEdit两部分组成。
(1)、HLogKey
由table name、region name以及sequenceid等字段构成。
WALKeyImpl:
@InterfaceAudience.Private
protected void init(final byte[] encodedRegionName,
final TableName tablename,
long logSeqNum,
final long now,
List<UUID> clusterIds,
long nonceGroup,
long nonce,
MultiVersionConcurrencyControl mvcc,
NavigableMap<byte[], Integer> replicationScope,
Map<String, byte[]> extendedAttributes) {
this.sequenceId = logSeqNum;
this.writeTime = now;
this.clusterIds = clusterIds;
this.encodedRegionName = encodedRegionName;
this.tablename = tablename;
this.nonceGroup = nonceGroup;
this.nonce = nonce;
this.mvcc = mvcc;
if (logSeqNum != NO_SEQUENCE_ID) {
setSequenceId(logSeqNum);
}
this.replicationScope = replicationScope;
this.extendedAttributes = extendedAttributes;
}
(2)、WALEdit
用来表示一个事务中的更新集合
KeyValue:
private static byte [] createByteArray(final byte [] row, final int roffset,
final int rlength, final byte [] family, final int foffset, int flength,
final byte [] qualifier, final int qoffset, int qlength,
final long timestamp, final Type type,
final byte [] value, final int voffset,
int vlength, byte[] tags, int tagsOffset, int tagsLength) {
checkParameters(row, rlength, family, flength, qlength, vlength);
RawCell.checkForTagsLength(tagsLength);
// Allocate right-sized byte array.
int keyLength = (int) getKeyDataStructureSize(rlength, flength, qlength);
byte[] bytes = new byte[(int) getKeyValueDataStructureSize(rlength, flength, qlength, vlength,
tagsLength)];
// Write key, value and key row length.
int pos = 0;
pos = Bytes.putInt(bytes, pos, keyLength);
pos = Bytes.putInt(bytes, pos, vlength);
pos = Bytes.putShort(bytes, pos, (short)(rlength & 0x0000ffff));
pos = Bytes.putBytes(bytes, pos, row, roffset, rlength);
pos = Bytes.putByte(bytes, pos, (byte)(flength & 0x0000ff));
if(flength != 0) {
pos = Bytes.putBytes(bytes, pos, family, foffset, flength);
}
if(qlength != 0) {
pos = Bytes.putBytes(bytes, pos, qualifier, qoffset, qlength);
}
pos = Bytes.putLong(bytes, pos, timestamp);
pos = Bytes.putByte(bytes, pos, type.getCode());
if (value != null && value.length > 0) {
pos = Bytes.putBytes(bytes, pos, value, voffset, vlength);
}
// Add the tags after the value part
if (tagsLength > 0) {
pos = Bytes.putAsShort(bytes, pos, tagsLength);
pos = Bytes.putBytes(bytes, pos, tags, tagsOffset, tagsLength);
}
return bytes;
}
2、WAL写入流程 FSHLOG
分析WAL写入流程前,先简单看下hbase put/delete 的操作流程。核心类是HRegion.java,put/delete操作都要先找到数据所属的region,然后调用HRegion的相关方法进行操作。下面以put操作为例,简单说明:
put方法
put 方法是入口。实际执行逻辑在doBatchMutate方法中。
@Override
public void put(Put put) throws IOException {
checkReadOnly();//检查是否只读
// Do a rough check that we have resources to accept a write. The check is
// 'rough' in that between the resource check and the call to obtain a
// read lock, resources may run out. For now, the thought is that this
// will be extremely rare; we'll deal with it when it happens.
checkResources();
startRegionOperation(Operation.PUT);
try {
// All edits for the given row (across all column families) must happen atomically.
doBatchMutate(put);//一行的所有列族必须原子性的修改
} finally {
closeRegionOperation(Operation.PUT);
}
}
private void doBatchMutate(Mutation mutation) throws IOException {
// Currently this is only called for puts and deletes, so no nonces.
OperationStatus[] batchMutate = this.batchMutate(new Mutation[]{mutation});
if (batchMutate[0].getOperationStatusCode().equals(OperationStatusCode.SANITY_CHECK_FAILURE)) {
throw new FailedSanityCheckException(batchMutate[0].getExceptionMsg());
} else if (batchMutate[0].getOperationStatusCode().equals(OperationStatusCode.BAD_FAMILY)) {
throw new NoSuchColumnFamilyException(batchMutate[0].getExceptionMsg());
} else if (batchMutate[0].getOperationStatusCode().equals(OperationStatusCode.STORE_TOO_BUSY)) {
throw new RegionTooBusyException(batchMutate[0].getExceptionMsg());
}
}
/**
* Perform a batch of mutations.
*
* It supports only Put and Delete mutations and will ignore other types passed. Operations in
* a batch are stored with highest durability specified of for all operations in a batch,
* except for {@link Durability#SKIP_WAL}.
*
* <p>This function is called from {@link #batchReplay(WALSplitUtil.MutationReplay[], long)} with
* {@link ReplayBatchOperation} instance and {@link #batchMutate(Mutation[], long, long)} with
* {@link MutationBatchOperation} instance as an argument. As the processing of replay batch
* and mutation batch is very similar, lot of code is shared by providing generic methods in
* base class {@link BatchOperation}. The logic for this method and
* {@link #doMiniBatchMutate(BatchOperation)} is implemented using methods in base class which
* are overridden by derived classes to implement special behavior.
*
* @param batchOp contains the list of mutations
* @return an array of OperationStatus which internally contains the
* OperationStatusCode and the exceptionMessage if any.
* @throws IOException if an IO problem is encountered
*/
OperationStatus[] batchMutate(BatchOperation<?> batchOp) throws IOException {
boolean initialized = false;
batchOp.startRegionOperation();
try {
while (!batchOp.isDone()) {
if (!batchOp.isInReplay()) {
checkReadOnly();
}
checkResources();
if (!initialized) {
this.writeRequestsCount.add(batchOp.size());
// validate and prepare batch for write, for MutationBatchOperation it also calls CP
// prePut()/ preDelete() hooks
batchOp.checkAndPrepare();
initialized = true;
}
doMiniBatchMutate(batchOp);
requestFlushIfNeeded();
}
} finally {
if (rsServices != null && rsServices.getMetrics() != null) {
rsServices.getMetrics().updateWriteQueryMeter(this.htableDescriptor.
getTableName(), batchOp.size());
}
batchOp.closeRegionOperation();
}
return batchOp.retCodeDetails;
}
/**
* Called to do a piece of the batch that came in to {@link #batchMutate(Mutation[], long, long)}
* In here we also handle replay of edits on region recover. Also gets change in size brought
* about by applying {@code batchOp}.
*/
private void doMiniBatchMutate(BatchOperation<?> batchOp) throws IOException {
boolean success = false;
WALEdit walEdit = null;
WriteEntry writeEntry = null;
boolean locked = false;
// We try to set up a batch in the range [batchOp.nextIndexToProcess,lastIndexExclusive)
MiniBatchOperationInProgress<Mutation> miniBatchOp = null;
/** Keep track of the locks we hold so we can release them in finally clause */
List<RowLock> acquiredRowLocks = Lists.newArrayListWithCapacity(batchOp.size());
try {
// STEP 1. Try to acquire as many locks as we can and build mini-batch of operations with
// locked rows
// 添加行锁(实际是下面的读写锁)
miniBatchOp = batchOp.lockRowsAndBuildMiniBatch(acquiredRowLocks);
// We've now grabbed as many mutations off the list as we can
// Ensure we acquire at least one.
if (miniBatchOp.getReadyToWriteCount() <= 0) {
// Nothing to put/delete -- an exception in the above such as NoSuchColumnFamily?
return;
}
// 添加读写锁。
lock(this.updatesLock.readLock(), miniBatchOp.getReadyToWriteCount());
locked = true;
// STEP 2. Update mini batch of all operations in progress with LATEST_TIMESTAMP timestamp
// We should record the timestamp only after we have acquired the rowLock,
// otherwise, newer puts/deletes are not guaranteed to have a newer timestamp
// 这里的时间戳获取的是 加锁以后的时间戳,保证时间戳的有效性
long now = EnvironmentEdgeManager.currentTime();
batchOp.prepareMiniBatchOperations(miniBatchOp, now, acquiredRowLocks);
// STEP 3. Build WAL edit
List<Pair<NonceKey, WALEdit>> walEdits = batchOp.buildWALEdits(miniBatchOp);
// STEP 4. Append the WALEdits to WAL and sync.
for(Iterator<Pair<NonceKey, WALEdit>> it = walEdits.iterator(); it.hasNext();) {
Pair<NonceKey, WALEdit> nonceKeyWALEditPair = it.next();
walEdit = nonceKeyWALEditPair.getSecond();
NonceKey nonceKey = nonceKeyWALEditPair.getFirst();
if (walEdit != null && !walEdit.isEmpty()) {
writeEntry = doWALAppend(walEdit, batchOp.durability, batchOp.getClusterIds(), now,
nonceKey.getNonceGroup(), nonceKey.getNonce(), batchOp.getOrigLogSeqNum());
}
// Complete mvcc for all but last writeEntry (for replay case)
if (it.hasNext() && writeEntry != null) {
mvcc.complete(writeEntry);
writeEntry = null;
}
}
// STEP 5. Write back to memStore
// NOTE: writeEntry can be null here
// 写 memstore
writeEntry = batchOp.writeMiniBatchOperationsToMemStore(miniBatchOp, writeEntry);
// STEP 6. Complete MiniBatchOperations: If required calls postBatchMutate() CP hook and
// complete mvcc for last writeEntry
// 完成MiniBatchOperations
batchOp.completeMiniBatchOperations(miniBatchOp, writeEntry);
writeEntry = null;
success = true;
} finally {
// Call complete rather than completeAndWait because we probably had error if walKey != null
if (writeEntry != null) mvcc.complete(writeEntry);
if (locked) {
this.updatesLock.readLock().unlock();
}
releaseRowLocks(acquiredRowLocks);
final int finalLastIndexExclusive =
miniBatchOp != null ? miniBatchOp.getLastIndexExclusive() : batchOp.size();
final boolean finalSuccess = success;
batchOp.visitBatchOperations(true, finalLastIndexExclusive, (int i) -> {
batchOp.retCodeDetails[i] =
finalSuccess ? OperationStatus.SUCCESS : OperationStatus.FAILURE;
return true;
});
batchOp.doPostOpCleanupForMiniBatch(miniBatchOp, walEdit, finalSuccess);
batchOp.nextIndexToProcess = finalLastIndexExclusive;
}
}
doMiniBatchMutate方法中体现了完整的数据put的流程,可以看到,分为以下几步:
- 添加读写锁
- 数据写入的时间以获取锁后的时间为准
- 构建 WALEdit
- 将WALEdits写WAL并且sync刷盘
- 写入memStore
- 完成MiniBatchOperations
这里主要看第三步和第四步。AsyncFSWAL
和FSHlog
是写入WAL的 核心类:
在HBase的演进过程中,HLog的写入模型几经改进,写入吞吐量得到极大提升。之前的版本中,HLog写入都需要经过三个阶段:首先将数据写入本地缓存,然后将本地缓存写入文件系统,最后执行sync操作同步到磁盘。
很显然,三个阶段是可以流水线工作的,基于这样的设想,写入模型自然就想到“生产者-消费者”队列实现。然而之前版本中,生产者之间、消费者之间以及生产者与消费者之间的线程同步都是由HBase系统实现,使用了大量的锁,在写入并发量非常大的情况下会频繁出现恶性抢占锁的问题,写入性能较差。
当前版本中,HBase使用LMAX Disruptor框架实现了无锁有界队列操作。基于Disruptor的HLog写入模型如图
图中最左侧部分是Region处理HLog写入的两个前后操作:append和sync。当调用append后,WALEdit和HLogKey会被封装成FSWALEntry类,进而再封装成Ring
BufferTruck类放入Disruptor无锁有界队列中。当调用sync后,会生成一个SyncFuture,再封装成RingBufferTruck类放入同一个队列中,然后工作线程会被阻塞,等待notify()来唤醒。
图最右侧部分是消费者线程,在Disruptor框架中有且仅有一个消费者线程工作。这个框架会从Disruptor队列中依次取出RingBufferTruck对象,然后根据如下选项来操作:
·如果RingBufferTruck对象中封装的是FSWALEntry,就会执行文件append操作,将记录追加写入HDFS文件中。需要注意的是,此时数据有可能并没有实际落盘,而只是写入到文件缓存。
·如果RingBufferTruck对象是SyncFuture,会调用线程池的线程异步地批量刷盘,刷盘成功之后唤醒工作线程完成HLog的sync操作.
wal日志的写入是通过HRegion中的wal对象写入的:
/**
* @return writeEntry associated with this append
*/
private WriteEntry doWALAppend(WALEdit walEdit, Durability durability, List<UUID> clusterIds,
long now, long nonceGroup, long nonce, long origLogSeqNum) throws IOException {
Preconditions.checkArgument(walEdit != null && !walEdit.isEmpty(),
"WALEdit is null or empty!");
Preconditions.checkArgument(!walEdit.isReplay() || origLogSeqNum != SequenceId.NO_SEQUENCE_ID,
"Invalid replay sequence Id for replay WALEdit!");
// Using default cluster id, as this can only happen in the originating cluster.
// A slave cluster receives the final value (not the delta) as a Put. We use HLogKey
// here instead of WALKeyImpl directly to support legacy coprocessors.
WALKeyImpl walKey = walEdit.isReplay()?
new WALKeyImpl(this.getRegionInfo().getEncodedNameAsBytes(),
this.htableDescriptor.getTableName(), SequenceId.NO_SEQUENCE_ID, now, clusterIds,
nonceGroup, nonce, mvcc) :
new WALKeyImpl(this.getRegionInfo().getEncodedNameAsBytes(),
this.htableDescriptor.getTableName(), SequenceId.NO_SEQUENCE_ID, now, clusterIds,
nonceGroup, nonce, mvcc, this.getReplicationScope());
if (walEdit.isReplay()) {
walKey.setOrigLogSeqNum(origLogSeqNum);
}
//don't call the coproc hook for writes to the WAL caused by
//system lifecycle events like flushes or compactions
if (this.coprocessorHost != null && !walEdit.isMetaEdit()) {
this.coprocessorHost.preWALAppend(walKey, walEdit);
}
WriteEntry writeEntry = null;
try {
long txid = this.wal.appendData(this.getRegionInfo(), walKey, walEdit);
// Call sync on our edit.
if (txid != 0) {
sync(txid, durability);
}
writeEntry = walKey.getWriteEntry();
} catch (IOException ioe) {
if (walKey != null && walKey.getWriteEntry() != null) {
mvcc.complete(walKey.getWriteEntry());
}
throw ioe;
}
return writeEntry;
}
@Override
public long appendData(RegionInfo info, WALKeyImpl key, WALEdit edits) throws IOException {
return append(info, key, edits, true);
}
@Override
protected long append(RegionInfo hri, WALKeyImpl key, WALEdit edits, boolean inMemstore)
throws IOException {
if (markerEditOnly() && !edits.isMetaEdit()) {
throw new IOException("WAL is closing, only marker edit is allowed");
}
long txid = stampSequenceIdAndPublishToRingBuffer(hri, key, edits, inMemstore,
waitingConsumePayloads);
if (shouldScheduleConsumer()) {
consumeExecutor.execute(consumer);
}
return txid;
}
protected final long stampSequenceIdAndPublishToRingBuffer(RegionInfo hri, WALKeyImpl key,
WALEdit edits, boolean inMemstore, RingBuffer<RingBufferTruck> ringBuffer)
throws IOException {
if (this.closed) {
throw new IOException(
"Cannot append; log is closed, regionName = " + hri.getRegionNameAsString());
}
MutableLong txidHolder = new MutableLong();
MultiVersionConcurrencyControl.WriteEntry we = key.getMvcc().begin(() -> {
txidHolder.setValue(ringBuffer.next());//由ringBuffer生成序列号
});
//txid与上面序列号一致
long txid = txidHolder.longValue();
ServerCall<?> rpcCall = RpcServer.getCurrentCall().filter(c -> c instanceof ServerCall)
.filter(c -> c.getCellScanner() != null).map(c -> (ServerCall) c).orElse(null);
try (TraceScope scope = TraceUtil.createTrace(implClassName + ".append")) {
//创建FSWALEntry实例
FSWALEntry entry = new FSWALEntry(txid, key, edits, hri, inMemstore, rpcCall);
entry.stampRegionSequenceId(we);
ringBuffer.get(txid).load(entry); //写入disruptor队列
} finally {
ringBuffer.publish(txid); //写入disruptor队列
}
return txid;
}
上文说的KeyValue实际是封装到了WALEdit
中。WALEdit
中有ArrayList<Cell> cells = null
,而KeyValue就是Cell的实现类
Sync部分源码
public void sync(long txid, boolean forceSync) throws IOException {
if (highestSyncedTxid.get() >= txid) {
return;
}
try (TraceScope scope = TraceUtil.createTrace("AsyncFSWAL.sync")) {
// here we do not use ring buffer sequence as txid
long sequence = waitingConsumePayloads.next();//获取序列号
SyncFuture future;
try {
//这里sync实际写入的 syncFuture,可以理解为 是一个刷盘标记
future = getSyncFuture(txid, forceSync);
RingBufferTruck truck = waitingConsumePayloads.get(sequence);
truck.load(future);
} finally {
waitingConsumePayloads.publish(sequence);
}
if (shouldScheduleConsumer()) {
consumeExecutor.execute(consumer);
}
//写入RingBuffer后需要阻塞等待,确保刷盘成功
blockOnSync(future);
}
}
protected final void blockOnSync(SyncFuture syncFuture) throws IOException {
// Now we have published the ringbuffer, halt the current thread until we get an answer back.
try {
if (syncFuture != null) {
if (closed) {
throw new IOException("WAL has been closed");
} else {
//RingBufferEventHandler消费完sync的消息后会唤醒该线程
syncFuture.get(walSyncTimeoutNs);
}
}
} catch (TimeoutIOException tioe) {
// SyncFuture reuse by thread, if TimeoutIOException happens, ringbuffer
// still refer to it, so if this thread use it next time may get a wrong
// result.
this.cachedSyncFutures.remove();
throw tioe;
} catch (InterruptedException ie) {
LOG.warn("Interrupted", ie);
throw convertInterruptedExceptionToIOException(ie);
} catch (ExecutionException e) {
throw ensureIOException(e.getCause());
}
}
FSHLog.RingBufferEventHandler#onEvent的核心代码:
@Override
// We can set endOfBatch in the below method if at end of our this.syncFutures array
public void onEvent(final RingBufferTruck truck, final long sequence, boolean endOfBatch)
throws Exception {
try {
if (truck.type() == RingBufferTruck.Type.SYNC) {
this.syncFutures[this.syncFuturesCount.getAndIncrement()] = truck.unloadSync();
// 收集一批syncFuture任务,为了提高效率。不过批次不宜太大,否则Region Server RPC服务线程阻塞在SyncFuture.get()上的时间就越长
if (this.syncFuturesCount.get() == this.syncFutures.length) {
endOfBatch = true;
}
} else if (truck.type() == RingBufferTruck.Type.APPEND) {
FSWALEntry entry = truck.unloadAppend();
try {
if (this.exception != null) {
// Return to keep processing events coming off the ringbuffer
return;
}
append(entry);
} catch (Exception e) {
// Failed append. Record the exception.
this.exception = e;
cleanupOutstandingSyncsOnException(sequence,
this.exception instanceof DamagedWALException ? this.exception
: new DamagedWALException("On sync", this.exception));
// Return to keep processing events coming off the ringbuffer
return;
} finally {
entry.release();
}
} else {
// What is this if not an append or sync. Fail all up to this!!!
cleanupOutstandingSyncsOnException(sequence,
new IllegalStateException("Neither append nor sync"));
// Return to keep processing.
return;
}
if (this.exception == null) {
if (!endOfBatch || this.syncFuturesCount.get() <= 0) {
return;
}
//轮询从syncRunners中拿一个线程
this.syncRunnerIndex = (this.syncRunnerIndex + 1) % this.syncRunners.length;
try {
//将要执行的任务添加到syncRunner的阻塞队列
this.syncRunners[this.syncRunnerIndex].offer(sequence, this.syncFutures,
this.syncFuturesCount.get());
} catch (Exception e) {
// Should NEVER get here.
requestLogRoll(ERROR);
this.exception = new DamagedWALException("Failed offering sync", e);
}
}
// We may have picked up an exception above trying to offer sync
if (this.exception != null) {
cleanupOutstandingSyncsOnException(sequence, this.exception instanceof DamagedWALException
? this.exception : new DamagedWALException("On sync", this.exception));
}
attainSafePoint(sequence);
this.syncFuturesCount.set(0);
} catch (Throwable t) {
LOG.error("UNEXPECTED!!! syncFutures.length=" + this.syncFutures.length, t);
}
}
FSHLog.SyncRunner#run
的核心代码:
@Override
public void run() {
long currentSequence;
while (!isInterrupted()) {
int syncCount = 0;
try {
SyncFuture sf;
while (true) {
takeSyncFuture = null;
//从syncFutures队列中取任务执行
takeSyncFuture = this.syncFutures.take();
// Make local copy.
sf = takeSyncFuture;
currentSequence = this.sequence;
long syncFutureSequence = sf.getTxid();
if (syncFutureSequence > currentSequence) {
throw new IllegalStateException("currentSequence=" + currentSequence
+ ", syncFutureSequence=" + syncFutureSequence);
}
// WAL日志消费线程一次会提交多个SyncFuture。对此,SyncRunner线程只会落实执行其中最新的SyncFuture(也就是Sequence ID最大的那个)所代表的Sync操作。而忽略之前的SyncFuture。
long currentHighestSyncedSequence = highestSyncedTxid.get();
if (currentSequence < currentHighestSyncedSequence) {
syncCount += releaseSyncFuture(sf, currentHighestSyncedSequence, null);
// Done with the 'take'. Go around again and do a new 'take'.
continue;
}
break;
}
long start = System.nanoTime();
Throwable lastException = null;
try {
TraceUtil.addTimelineAnnotation("syncing writer");
long unSyncedFlushSeq = highestUnsyncedTxid;
//它的实现在 ProtobufLogWriter.sync ,调用FSDataOutputStream的hsync或hflush方法刷盘
writer.sync(sf.isForceSync());
TraceUtil.addTimelineAnnotation("writer synced");
if (unSyncedFlushSeq > currentSequence) {
currentSequence = unSyncedFlushSeq;
}
currentSequence = updateHighestSyncedSequence(currentSequence);
} catch (IOException e) {
LOG.error("Error syncing, request close of WAL", e);
lastException = e;
} catch (Exception e) {
LOG.warn("UNEXPECTED", e);
lastException = e;
} finally {
//唤醒等待的线程
syncCount += releaseSyncFuture(takeSyncFuture, currentSequence, lastException);
// Can we release other syncs?
syncCount += releaseSyncFutures(currentSequence, lastException);
if (lastException != null) {
requestLogRoll(ERROR);
} else {
checkLogRoll();
}
}
postSync(System.nanoTime() - start, syncCount);
} catch (InterruptedException e) {
// Presume legit interrupt.
Thread.currentThread().interrupt();
} catch (Throwable t) {
LOG.warn("UNEXPECTED, continuing", t);
}
}
}
}