各位读者,很抱歉这篇文章是英文的,我当初做笔记的时候,写成英文了,这样才可以在同事之间交流。
而现在确实没时间翻译过来了,还望大家理解,谢谢!



In this article I'd like to talk about the caveats and how-to's when doing
performance test with Berkeley DB, when the data volume is huge.
For legal reasons I can not publish the result of my test without further
approval, so I decided not to do so.

I. Context

I need to insert 10 billion key/data pairs to a btree database, each key
item is 768 bytes, with no duplicate keys, and keys are inserted increasing only;
Each data item varies between KB/2 to 1KB. Thus each key/data pair
varies between 1.25KB to 2KB.


After insertion, I search for some keys, some keys are in the db, others not,
to find out the average time to insert a key/data pair, and to find a key/data
pair. I use a very powerful Linux 64bit machine, which has 4 processers,
each processor is a 3.2GB intel Xeon, 8GB memory and 8TB of storage.


II. Things to Note

The specially huge data volumn means a lot in this test, there are many things to note:

1. Integer overflow.
Loop variables will overflow if we insert so many key/data pairs in a loop, so
split the job into pieces, making sure each piece of job won't overflow an 32bit integer.
If you used signed loop variables, they can go negative before reaching the limit
value, and thus falls into an endless loop. In such a use case, we should be very
careful about all integer variables being overflown.

Another example: If using 32bit integer as index keys, they are also overflown,
causing not as many key/data pairs inserted as assumed, because later key/data pairs
with keys already present in db will overwrite previous ones.
So we must use DB_SEQUENCE to generate 64bit sequential keys.

2. Random integer generator (rand() C function) loses randomness.
The RAND_MAX is only 64K, and when over 32K random integers are generated,
the randomness fades, and I observed some none-randomness in my test code.

3. Huge key/data pair.

Context: Each key/data pair can be as large as 2KB, and there can be millions of pages.
Problem:
A lot of overflow pages can be generated if using default page size.
Solution:
Set page size to 64k, so that each internal node can hold most number of keys, thus,
the internal space waste, the number of times to read more internal nodes during
search and the number of overflown pages is minimized. Concurrency is not harmed
since each key/data pair is so big.

Problem:
Stack space insufficient.
Solution:
Do not allocate huge buffer (more than several MB) on stack, allocate it on heap,
otherwise the stack may not be big enough.

4. Berkeley DB configuration
a. Do not use logs, otherwise, we will have to store another copy of the dataset,
  which is too much in this case as we are inserting 10 billion records each can be
  3KB/4 to 2KB size.
b. Need a huge cache, otherwise the internal nodes won't fit into the cache, btree
  search would be too slow.
c. When cache size is set to 6G on seneca, if using DB_PRIVATE, it is very slow,
  cpu usage is always less than 10%, because virtual memory (backed by disk) is used;
  when turned to use file mapping, cpu usage can be over 70%(later observation shows
  that cpu usage falls back to less than 10% after several millions of key/data pair
  are inserted, only when key/data pairs are inserted in order, did the insertion
  became fast. This is practical though, since we can always find sequentially
  increasing keys to use, and we can rely on secondary keys to use the various real
  keys.), the program is much faster.
d. May need to start multiple processes and insert simultaneously. And may need to
  split the cache into multiple files on some platforms which limit the maximum file size.
e. Use partitioned database, so that each database file is not so huge. If using
  multiple processes at the same time, each process inserting in one db file,
  concurrency can be promoted a lot. Though here I only used one process.

5. When changed to use ordered (increasing) keys, the insertion is much faster,
  as expected, because search becomes a very cheap operation in this case.
  In such a huge dataset, always use predictable sequentially increasing keys, like
  sequence numbers(1, 2, 3, ...). For example we want to insert many Person
  objects, each of which has an 'ID' field, though we can not get Person objects
  ordered by 'ID', then we should use a sequence as the key for primary db
  containing Person objects, and create secondary databases each of which uses
  one of the properties of Person. The 'ID' secondary db is much smaller because
  the 'data' item of a key/data pair in it is only a sequence number, thus much
  easier and faster to find the Person object by its ID, and can insert very
  fast at the same time.

6. When the search phase begins, the pages are gradually loaded into cache, but
  since the keys were randomly chosen, the search is not fast. A lot of
  internal btree pages are being swapped in/out of the cache.