Analyzing crash recovery
If a failure occurs while the instance is running and DB2(R)
cannot complete or commit all of the outstanding units of work, the database is
left in an inconsistent and unusable state. Crash recovery is the process by
which the database is moved back to a consistent and usable state.
When such a situation occurs, you will see entries in the db2diag.log similar
to the following when a user tries to connect to or activate the database:
2005-02-04-11.03.01.920391-480 I3693708C368 LEVEL: Warning
PID : 53980 TID : 1 PROC : db2agent (SAMPLE) 0
INSTANCE: db2inst1 NODE : 000 DB : SAMPLE
APPHDL : 0-8 APPID: *LOCAL.DB2.050114231611
FUNCTION: DB2 UDB, base sys utilities, sqledint, probe:30
MESSAGE : Crash Recovery is needed.
2005-02-04-11.03.05.223292-480 I3694077C431 LEVEL: Warning
PID : 53980 TID : 1 PROC : db2agent (SAMPLE) 0
INSTANCE: db2inst1 NODE : 000 DB : SAMPLE
APPHDL : 0-8 APPID: *LOCAL.DB2.050114231611
FUNCTION: DB2 UDB, recovery manager, sqlpresr, probe:410
MESSAGE : Crash recovery started. LowtranLSN 00000000E0D7D193 MinbuffLSN
00000000E0D7CC4D
LSN stands for Log Sequence Number. Database manager identifies and tracks
transactions using LSN numbers. Note that an application-level "transaction" may
be split into many LSNs internally, for example, delete of all rows from a table
in a single unit of work. The entire deletion is one transaction but internally
there will be one LSN logged for every deleted row.
The LowtranLSN and MinbuffLSN values supplied here are
helpful if a problem occurs during the recovery, since they give insight into
the log sequence numbers that crash recovery will be processing.
The LowtranLSN value is the log sequence number of the first log
record written by the oldest uncommitted transaction. The age of the transaction
here is based on the time transactions write their first log records, not the
unit of work (UOW) start time reported by database monitor.
DB2 uses write-ahead logging; this means that every transaction is first
written to log files and then to the database. Minbuff LSN indicates the oldest
LSN that has been committed but not flushed to the database.
Crash recovery will start using the lowest of MINBUFLSN and LOWTRANLSN (in
some cases they are the same).
If you are using log retention logging (as opposed to circular logging), you
can use the db2flsn tool to match a log sequence number to a log file.
Recovery is comprised of several "phases". If crash recovery fails, it is
helpful to know which phase it was in at the time. The phases are:
- Redo or forward phase: During this phase all available transactions,
committed or uncommitted are replayed.
- Undo Phase - During this phase, uncommitted transactions are rolled back.
This is the second phase. Here, DB2 will rollback any transactions that were
uncommitted at the time of the crash. This can potentially be a lengthy amount
of time if, for example, an application had done a large amount of work without
committing before it crashed. At the end of redo phase, DB2 compiles a list of
all the transactions for which there were no commit records. This becomes the
list of unresolved or uncommitted transactions. The logger then writes
compensation or undo log records in order to roll back these transactions.
- Sideward phase (Multi-partition only) - This is the last phase. If any
synchronization needs to be done between the current partition and the catalog
partition, it will be done during this phase.
Note that if you are concerned that recovery seems to be taking a long time,
you can monitor its status using the LIST UTILITIES command.
If the recovery completes successfully, you will see an entry like this:
2005-02-04-11.03.05.605091-480 I3696386C497 LEVEL: Warning
PID : 53980 TID : 1 PROC : db2agent (SAMPLE) 0
INSTANCE: db2inst1 NODE : 000 DB : SAMPLE
APPHDL : 0-8 APPID: *LOCAL.DB2.050114231611
FUNCTION: DB2 UDB, recovery manager, sqlpresr, probe:3170
MESSAGE : Crash recovery completed. Next LSN is 000000750CF31816
Whereas if the recovery fails, a return code would be supplied as
follows:
2005-02-04-11.03.05.605091-480 I3696386C497 LEVEL: Error
PID : 53980 TID : 1 PROC : db2agent (SAMPLE) 0
INSTANCE: db2inst1 NODE : 000 DB : SAMPLE
APPHDL : 0-8 APPID: *LOCAL.DB2.050114231611
FUNCTION: DB2 UDB, recovery manager, sqlpresr, probe:640
DATA #1 : String, 140 bytes
Crash recovery failed with rc -2028994521
LowtranLSN 00000000E0D7CC4D MinbuffLSN 00000000E0D7CC4D
Recovery started on log file S0000153.LOG.
These crash recovery return codes are not documented, but quite often you can
uncover clues to the cause of the problem by looking at earlier db2diag.log
entries. For example, there may be indications of log corruption, bad log
records, or a bad page on disk. An example of this type of message in the
db2diag.log is as follows:
2005-02-24-12.05.42.265000+480 I2548H467 LEVEL: Error
PID : 2048 TID : 2016 PROC : db2syscs.exe
INSTANCE: DB2 NODE : 000 DB : SAMPLE
APPHDL : 0-8 APPID: *LOCAL.DB2.050224040537
FUNCTION: DB2 UDB, buffer pool services, sqlbcres, probe:1100
RETCODE : ZRC=0x86020019=-2046689255=SQLB_CSUM "Bad Page, Checksum Error"
DIA8426C A invalid page checksum was found for page "".
2005-02-24-12.05.42.281000+480 I3017H374 LEVEL: Error
PID : 2048 TID : 2016 PROC : db2syscs.exe
INSTANCE: DB2 NODE : 000 DB : SAMPLE
APPHDL : 0-8 APPID: *LOCAL.DB2.050224040537
FUNCTION: DB2 UDB, buffer pool services, sqlbcres, probe:1100
MESSAGE : Obj={pool:2;obj:12;type:0} State=x27
The second entry in this example indicates that object 12 in tablespace 2 is
corrupted. To confirm whether or not the table is corrupted, you could use the
db2dart tool.
The benefit of analyzing these messages is that they may provide DBAs with
clues as to what actions need to be taken next in order to get the database back
up and running. For example, you may be able to:
- Mark the corrupted objects as being "bad" using the db2dart tool and then in
skip these objects during a subsequent attempt at crash recovery
- Restore the database from an earlier backup
Note that while many of these actions will allow you to get up and running
quickly, they may also reduce the possibility that DB2 Support will be able to
do complete cause analysis thereafter. Thus this should be taken into
consideration when you are deciding whether to take these actions prior to
calling DB2 Suppport.