LevelDB作为一个数据存储引擎,存储的数据大部分是在磁盘上的,而磁盘上数据的表现形式就是文件,也就是本章要介绍的SST文件,SSTable 是 Sorted String Table 的简称,SST的生成时机有两个,一是内存中的Immutable Memtable Flush到磁盘上会生成SST文件,二是在Compaction的时候相邻层级的SST文件合并生成新的SST文件,而这两者都是通过TableBuilder来生成SST文件的,本片将会介绍SST文件的生成过程以及文件结构

 

Write performance

The "fill" benchmarks create a brand new database, in either sequential, or random order. The "fillsync" benchmark flushes data from the operating system to the disk after every operation; the other write operations leave the data sitting in the operating system buffer cache for a while. The "overwrite" benchmark does random writes that update existing keys in the database.

fillseq      :       1.765 micros/op;   62.7 MB/s
fillsync : 268.409 micros/op; 0.4 MB/s (10000 ops)
fillrandom : 2.460 micros/op; 45.0 MB/s
overwrite : 2.380 micros/op; 46.5 MB/s


Each "op" above corresponds to a write of a single key/value pair. I.e., a random write benchmark goes at approximately 400,000 writes per second.

Each "fillsync" operation costs much less (0.3 millisecond) than a disk seek (typically 10 milliseconds). We suspect that this is because the hard disk itself is buffering the update in its memory and responding before the data has been written to the platter. This may or may not be safe based on whether or not the hard disk has enough power to save its memory in the event of a power failure.

Read performance

We list the performance of reading sequentially in both the forward and reverse direction, and also the performance of a random lookup. Note that the database created by the benchmark is quite small. Therefore the report characterizes the performance of leveldb when the working set fits in memory. The cost of reading a piece of data that is not present in the operating system buffer cache will be dominated by the one or two disk seeks needed to fetch the data from disk. Write performance will be mostly unaffected by whether or not the working set fits in memory.

readrandom  : 16.677 micros/op;  (approximately 60,000 reads per second)
readseq : 0.476 micros/op; 232.3 MB/s
readreverse : 0.724 micros/op; 152.9 MB/s


LevelDB compacts its underlying storage data in the background to improve read performance. The results listed above were done immediately after a lot of random writes. The results after compactions (which are usually triggered automatically) are better.

readrandom  : 11.602 micros/op;  (approximately 85,000 reads per second)
readseq : 0.423 micros/op; 261.8 MB/s
readreverse : 0.663 micros/op; 166.9 MB/s


Some of the high cost of reads comes from repeated decompression of blocks read from disk. If we supply enough cache to the leveldb so it can hold the uncompressed blocks in memory, the read performance improves again:

readrandom  : 9.775 micros/op;  (approximately 100,000 reads per second before compaction)
readrandom : 5.215 micros/op; (approximately 190,000 reads per second after compaction)