Berkeley DB 源代码分析（1） --- 代码特征与游标的实现

原创

dazhao_cn 2013-06-20 14:42:12 博主文章分类：Berkeley DB ©著作权

©著作权归作者所有：来自51CTO博客作者dazhao_cn的原创作品，请联系作者获取转载授权，否则将追究法律责任

I. General Notes
1. use a cursor to access db internally. cursor connects lock/txn/logging/AM, etc.
To get a page, first create a cursor if don't have one, then call __db_lget to lock the page, then call __memp_fget to get the page from cache,
then you have the page's pointer. after use, call __TLPUT OR __LPUT to release lock, then call __memp_fput to release page.

2. How to lock/unlock a page and get/put a page from mpool?
See __bam_read_lock.

3. Cursor Ajudgement
Cursor adjustments are logged if they are for subtransactions. This is
because it's possible for a subtransaction to adjust cursors which will
still be active after the subtransaction aborts, and so which must be
restored to their previous locations. Cursors that can be both affected
by our cursor adjustments and active after our transaction aborts can
only be found in our parent transaction -- cursors in other transactions,
including other child transactions of our parent, must have conflicting
locker IDs, and so cannot be affected by adjustments in this transaction.

When an key/data pair is deleted, there can be other cursors pointing to it so
we don't physically delete it immediately, but mark the key/data pair deleted, and mark
all cursors pointing at this k/d with C_DELETED (which is called a logical
delete). When the last cursor pointing at the 'deleted' k/d is closed, this key/data pair is physically deleted.
this is the only situation physical deletes happen, e.g. when the last cursor moves away from
a k/d marked BI_DELETED, that k/d is not deleted.

When closing a cursor if we find it has C_DELETED, we walk all cursors of the
same database and mark them with C_DELETED. And opd cursors if any will be checked
and marked too in the same way. This is done by __bam_ca_delete. This function
also tells us how many other cursors are sitting on this key/data pair. If no
more, we can do physical delete. If we find that there are still other cursors
sitting on the k/d, __bamc_close is done, otherwise we physically delete the
k/d, or even the opd btree/recno-tree and its on-page k/d items.

When deleting a k/d on opd pages, we don't lock the opd tree, we only lock the
page containing the on-page key/opd-root-pgno key/data pair.

It's impossible for there to be cursors from another process to sit on the
page where the key/data pair is logically deleted, because of the txnal or
handle locking. So it's sufficient to mark or adjust only cursors in the current
process when deleting/inserting a key/data pair.

Given a cursor C which made db ins/del so that we want to adjust cursors sitting
on the modified page, for each type of cursor adjust operation, it calls a __db_walk_cursors to
iterate all cursors of all DB handles of opened from the same db as C.db in C.env in current process,
and register a callback F into __db_walk_cursors for it to call against each
cursor. There are one F for each type of cursor adjust op. And if the adjust
op modifies a page, there will be log ops done, and there will be undo ops to
be called when aborting a child txn. This is the only time such adjust ops are
meaningful --- we want to restore cursors if a child txn aborts. In recovery
code no cursor adjust ops are recovered, because we don't need to restore
cursor state, we only want to restore data consistently.

TODO: my idea: transfer ownership of the data item marked deleted when a
cursor goes away or closes, until a cursor can't find another cursor on the
item--then it will physically delete the data item.

See comments in __bamc_close for more details.

4. Code file naming conventions in btree/hash/queue
AM_auto.c: contains generated log read/log functions to log changes made by
this AM.
AM_autop.c: contains generated log print functions.
AM. contains log records definitions, used by dist/gen_rec.awk to generate
log read/log/print functions.

AM_compact.c: contains functions to compact db file of this AM type.
AM_conv.c: contains functions to do AM specific pgin/out processing, which all
do page swapping for this AM.
AM_curadj.c: contains functions to do cursor ajudements. see above #3 for
details.

AM_method.c: contains simple functions and all functions to init db handle
function pointers.
AM_open.c: contains functions to open databases of AM type.
AM_rec.c: contains functions to do recovery for each type of logs.
AM_stat.c contains functions to accumulate/print stats of this AM.
AM_upgrade.c contains functions to upgrade db files of this AM to newer
versions.
AM_verify.c: contains functions to verify this AM db file.

5. Function naming conventions
1. __AMc_ACTION for cursor manipulations
__bamc_init, __bamc_close,
__bamc_destroy, __bamc_refresh(***_refresh refreshes the structure
as if it's newly created, so that it can be reused), etc. As well as public
methods such as __bamc_del, __bamc_get, __bamc_put, __bamc_cmp, __bamc_count,
such methods are set to DBC handle function pointers, so that calls to those
handle function pointers can be routed to AM specific ops.

6. Off Page Duplicates (opd)
Dup data items share the same key items, like (1, 2), (1, 3), and (1,4) which
have the same key 1 and different data items 2,3,4. In btree and hash AM,
2,3,4 are normally stored in leaf pages (btree) just like other key/data
pairs, but if dup data items consum larger space than the overflow size, they
are put into a off-page-duplicate tree, which is a btree, and the on-page data
item stores the root page no of the opd tree.

opd trees can be a recno tree too, when the DB_DUPSORT and the dup compare
function is not specified.

We don't acquire any lock to access an opd tree/page, because we always lock
the on-page key/opd-pgno keydata pair's page before accessing the opd page.

In a opd btree, there is no "data" items, only "key" items, even on the leaf
pages of a opd tree. The reason is that all "key"
items are dup data items of the same "key" in that db.

an overflow data item is stored on a chain of pages and leave a B_OVERFLOW
data item on the leaf page, OR ON THE OPD-TREE'S LEAF PAGE. That is to say,
it's always allowed that a data item of a set of dup key/data pairs is an overflow item.

Duplicate key/data pairs storage:
1. DB_DUP is set, DB_DUPSORT not set
When the dup data items don't consume over a quater of the page space, they
are put on btree leaf pages. Otherwise, they are put onto a
off page duplicate recno tree. I think it's better to put them on a chain of opd pages
unsorted, because we never randomly access a dup data item(recall the flags DB_NEXT,
DB_NEXT_DUP, DB_NEXT_NODUP, and the PREV versions for DBC->get).(TODO: try a
change.)

2. both DB_DUP and DB_DUPSORT set
When the dup data items don't consume over a quater of the page space, they
are put on btree leaf pages and sorted. Otherwise, they are put into a opd
btree.