DBA生存警示:保护现场不要让事情更糟_java

张乐奕

云和恩墨副总经理 Oracle ACE 总监

ITPUB Oracle数据库管理版版主、Oracle高可用版版主、ACOUG联合创始人

今天收到一个发过来请求帮助的 case,Oracle 数据库无法启动,请求帮助恢复。仔细阅读了发过来的告警日志,这是一个典型的“事情越弄越糟”的案例。

作为一个专业的DBA,在遇到问题时,一定要思考:如何保护现场,不让事情变得更糟。这是基本要求,保护现场以使得其他人接手工作时,可以从原有状态开始。

以下就来根据告警日志,一条一条地回顾这位 DBA 是如何将数据库弄到完全启动不了的。

故障最开始是从1月11日的凌晨 3:30 开始出现,数据库在归档的时候,意外发现某个控制文件的头块全部被清零了,这可能是存储本身的问题,并非人为。

Fri Jan 11 03:30:24 2013
Errors in file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc:
ORA-00227: corrupt block detected in control file: (block 1, # blocks 1)
ORA-00202: control file: '/oracle/oradata/dpdata/control03.ctl'
Master background archival failure: 227
Fri Jan 11 03:31:24 2013
Hex dump of (file 0, block 1) in trace file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc
Corrupt block relative dba: 0x00000001 (file 0, block 1)
Completely zero block found during control file header read
Fri Jan 11 03:31:24 2013
Errors in file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc:
ORA-00202: control file: '/oracle/oradata/dpdata/control03.ctl'
Fri Jan 11 03:31:24 2013
Errors in file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc:
ORA-00227: corrupt block detected in control file: (block 1, # blocks 1)
ORA-00202: control file: '/oracle/oradata/dpdata/control03.ctl'
Fri Jan 11 03:30:24 2013
Errors in file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc:
ORA-00227: corrupt block detected in control file: (block 1, # blocks 1)
ORA-00202: control file: '/oracle/oradata/dpdata/control03.ctl'
Fri Jan 11 03:30:24 2013
Errors in file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc:
ORA-00227: corrupt block detected in control file: (block 1, # blocks 1)
ORA-00202: control file: '/oracle/oradata/dpdata/control03.ctl'
Fri Jan 11 03:30:24 2013
Errors in file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc:
ORA-00227: corrupt block detected in control file: (block 1, # blocks 1)
ORA-00202: control file: '/oracle/oradata/dpdata/control03.ctl'

接下来,数据库痛苦地挣扎了半小时,期间不停地报相同的 ORA-00227 错误。一直到凌晨 4:01,终于 CKPT 进程也发现无法更新控制文件头部,于是强势地将数据库直接关闭了。

Fri Jan 11 04:01:25 2013
Hex dump of (file 0, block 1) in trace file /oracle/admin/dpdata/bdump/dpdata_ckpt_3007.trc
Corrupt block relative dba: 0x00000001 (file 0, block 1)
Completely zero block found during control file header read
Fri Jan 11 04:01:25 2013
Errors in file /oracle/admin/dpdata/bdump/dpdata_ckpt_3007.trc:
ORA-00202: control file: '/oracle/oradata/dpdata/control03.ctl'
Fri Jan 11 04:01:25 2013
Errors in file /oracle/admin/dpdata/bdump/dpdata_ckpt_3007.trc:
ORA-00227: corrupt block detected in control file: (block 1, # blocks 1)
ORA-00202: control file: '/oracle/oradata/dpdata/control03.ctl'
CKPT: terminating instance due to error 227
Fri Jan 11 04:01:25 2013
Errors in file /oracle/admin/dpdata/bdump/dpdata_pmon_2997.trc:
ORA-00227: corrupt block detected in control file: (block , # blocks )
Fri Jan 11 04:01:26 2013
Errors in file /oracle/admin/dpdata/bdump/dpdata_psp0_2999.trc:
ORA-00227: corrupt block detected in control file: (block , # blocks )
Instance terminated by CKPT, pid = 3007

接下来的5个小时,数据库静静地躺在机房里,没有人知道这个数据库已经挂掉了,一直到上午 DBA 来上班。他发现数据库无法访问,于是尝试重新启动数据库。

Fri Jan 11 09:15:51 2013
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Picked latch-free SCN scheme 3
Autotune of undo retention is turned on.

自然数据库无法正常启动,连 mount 状态都无法进入,因为某个控制文件头部已经损坏了。告警日志的信息明确地说明了无法读取 control03.ctl 文件的头块,因此在尝试 mount 数据库的时候报了 ORA-00205 错误。

Fri Jan 11 09:15:56 2013
ALTER DATABASE   MOUNT
Fri Jan 11 09:15:56 2013
ORA-00202: control file: '/oracle/oradata/dpdata/control03.ctl'
ORA-27047: unable to read the header block of file
Additional information: 2
Fri Jan 11 09:15:59 2013
ORA-205 signalled during: ALTER DATABASE   MOUNT...
Fri Jan 11 09:19:31 2013
Starting ORACLE instance (normal)
Fri Jan 11 09:19:43 2013
alter database mount
Fri Jan 11 09:19:43 2013
ORA-00202: control file: '/oracle/oradata/dpdata/control03.ctl'
ORA-27047: unable to read the header block of file
Additional information: 2
Fri Jan 11 09:19:43 2013
ORA-205 signalled during: alter database mount

接下来,这位 DBA 开始反复地关闭实例,又启动实例。这样的操作一直持续了1个小时,一直到上午的 10:28,可以想象这是多么纠结的一个小时。

忠告: 除非十分特殊的恢复案例,否则反复起停数据库实例是于事无补的。

Shutting down instance: further logons disabled
Fri Jan 11 09:23:47 2013
Stopping background process CJQ0
……
Fri Jan 11 09:38:02 2013
Starting ORACLE instance (normal)
……
Fri Jan 11 09:43:00 2013
Shutting down instance: further logons disabled
……
Fri Jan 11 09:43:58 2013
Starting ORACLE instance (normal)
……
Fri Jan 11 09:55:34 2013
ALTER DATABASE CLOSE NORMAL
……
Fri Jan 11 09:56:55 2013
Starting ORACLE instance (normal)
……
Fri Jan 11 10:28:10 2013
ALTER DATABASE CLOSE NORMAL

接下来 10:29 的再次重新启动数据库实例之前,DBA 终于意识到可能是控制文件出现了问题,因此修改了初始化参数,将报错的 control03.ctl 文件从初始化参数 control_files 中去掉了。这次数据库得以正常启动。

Fri Jan 11 10:29:20 2013
Starting ORACLE instance (normal)
……
control_files            = /data1/oradata/dpdata/control01.ctl, /data1/oradata/dpdata/control02.ctl
……
Fri Jan 11 10:29:37 2013
Completed: ALTER DATABASE OPEN

而DBA也迅速地作了一次备份控制文件的操作,但是正是这个操作引导到了后面噩梦一般的结果。

Fri Jan 11 10:36:14 2013
alter database backup controlfile to trace
Fri Jan 11 10:36:14 2013
Completed: alter database backup controlfile to trace

数据库又平稳地运行了一个上午,这种宁静到下午14:16的时候被打破,数据库开始报 ORA-600 错误,并且在 CKPT 进程作检查点的时候,报数据文件10和31的文件头部无法被正确读取。如果不是更深层次的原因,那么这可能仍然是跟凌晨时候控制文件意外损坏时候的故障一样,也许是存储子系统本身的问题。

Fri Jan 11 14:16:07 2013
Errors in file /oracle/admin/dpdata/udump/dpdata_ora_22240.trc:
ORA-00600: internal error code, arguments: [6002], [0], [0], [3], [0], [], [], []
Fri Jan 11 14:19:44 2013
Errors in file /oracle/admin/dpdata/bdump/dpdata_ckpt_9579.trc:
ORA-01171: datafile 10 going offline due to error advancing checkpoint
ORA-01122: database file 10 failed verification check
ORA-01110: data file 10: '/data2/DECTR_HIS2.dbf'
ORA-01251: Unknown File Header Version read for file number 10
Fri Jan 11 14:19:59 2013
Errors in file /oracle/admin/dpdata/bdump/dpdata_ckpt_9579.trc:
ORA-01171: datafile 31 going offline due to error advancing checkpoint
ORA-01122: database file 31 failed verification check
ORA-01110: data file 31: '/data3/ts2_dpcis.dbf'
ORA-01251: Unknown File Header Version read for file number 31

紧接着,应用系统的某个 JOB 也由于数据文件无法访问,而开始报错。

Fri Jan 11 14:30:19 2013
Errors in file /oracle/admin/dpdata/bdump/dpdata_j001_12993.trc:
ORA-12012: error on auto execute of job 88
ORA-00376: file 10 cannot be read at this time
ORA-01110: data file 10: '/data2/DECTR_HIS2.dbf'
ORA-06512: at "DECTR.P_MOVE_CONTS_SHIP", line 77
ORA-06512: at line 1

相同的报错一直持续了四十多分钟,之后 DBA 又再次介入了。但是 DBA 很奇怪地断然执行了 offline 这两个数据文件的操作,并在2分多钟之后,又尝试将两个数据文件再次 online 。而由于文件损坏,自然在 online 的时候遇到了 ORA-1122 错误,而无法成功 online 。

Fri Jan 11 15:16:23 2013
alter database datafile '/data3/ts2_dpcis.dbf' offline
Fri Jan 11 15:16:23 2013
Completed: alter database datafile '/data3/ts2_dpcis.dbf' offline
Fri Jan 11 15:17:05 2013
alter database datafile  '/data2/DECTR_HIS2.dbf'
offline
Fri Jan 11 15:17:05 2013
Completed: alter database datafile  '/data2/DECTR_HIS2.dbf'
offline
Fri Jan 11 15:19:41 2013
alter database datafile '/data3/ts2_dpcis.dbf' online
Fri Jan 11 15:19:41 2013
ORA-1122 signalled during: alter database datafile '/data3/ts2_dpcis.dbf' online...
Fri Jan 11 15:21:10 2013
alter database datafile  '/data2/DECTR_HIS2.dbf' online
Fri Jan 11 15:21:10 2013
ORA-1122 signalled during: alter database datafile  '/data2/DECTR_HIS2.dbf' online...

这才仅仅是噩梦的开始,接下来的一切属于危险动作,请勿轻易模仿。

遇到 ORA-1122 错误以后,DBA 考虑了 9 秒钟,再次断然地关闭了数据库,并随之又重新启动。由于仅仅是用户表空间数据文件损坏,并且之前也已经被 offline 了,因此数据库实例毫无障碍地 OPEN 成功。

Fri Jan 11 15:21:19 2013
Shutting down instance: further logons disabled
Fri Jan 11 15:21:19 2013
Stopping background process QMNC
Fri Jan 11 15:21:19 2013
Stopping background process CJQ0
Fri Jan 11 15:21:21 2013
Stopping background process MMNL
Fri Jan 11 15:21:22 2013
Stopping background process MMON
Fri Jan 11 15:21:23 2013
Shutting down instance (immediate)
……
Fri Jan 11 15:22:59 2013
Starting ORACLE instance (normal)
……
Fri Jan 11 15:23:13 2013
Completed: ALTER DATABASE OPEN

DBA再次尝试online数据文件的操作,同样的ORA-1122错误。

Fri Jan 11 15:23:31 2013
alter database datafile '/data3/ts2_dpcis.dbf' online
Fri Jan 11 15:23:31 2013
ORA-1122 signalled during: alter database datafile '/data3/ts2_dpcis.dbf' online...

考虑了2分半钟之后,DBA 也许是想起上午的时候做过控制文件的备份,因此决定进行数据库恢复。重启数据库到 nomount 状态,并开始进行RECOVER DATABASE USING BACKUP CONTROLFILE,ORA-1507 错误的意思是告知如果要使用备份的控制文件进行数据库恢复,那么应该要先使用备份的控制文件将数据库启动到 mount 状态。

Fri Jan 11 15:25:05 2013
Shutting down instance: further logons disabled
Fri Jan 11 15:25:05 2013
Stopping background process QMNC
Fri Jan 11 15:25:05 2013
Stopping background process CJQ0
Fri Jan 11 15:25:07 2013
Stopping background process MMNL
Fri Jan 11 15:25:08 2013
Stopping background process MMON
Fri Jan 11 15:25:09 2013
Shutting down instance (immediate)
……
Fri Jan 11 15:26:32 2013
Starting ORACLE instance (normal)
……
Fri Jan 11 15:26:46 2013
ALTER DATABASE RECOVER  database using backup controlfile until cancel  
Fri Jan 11 15:26:46 2013
ORA-1507 signalled during: ALTER DATABASE RECOVER  database using backup controlfile    until cancel  ...

DBA 于是将数据库启动到 mount 状态,继续进行数据库恢复。这其中的几个ORA 错误都是正常的,ORA-279 提示需要一个归档文件来完成恢复,ORA-308提示打不开 1_87749_604491553.dbf 归档文件,根据前面的告警日志,可以知道实际上 87749 这个重做日志是当前日志,还没有归档,自然找不到。ORA-1547错误表示恢复已经完成,但是 OPEN RESETLOGS 的时候仍然要遇到错误。

Fri Jan 11 15:26:56 2013
alter database mount
Fri Jan 11 15:27:00 2013
Setting recovery target incarnation to 2
Fri Jan 11 15:27:00 2013
Successful mount of redo thread 1, with mount id 560899584
Fri Jan 11 15:27:00 2013
Database mounted in Exclusive Mode
Completed: alter database mount
Fri Jan 11 15:27:10 2013
ALTER DATABASE RECOVER  database using backup controlfile until cancel  
Media Recovery Start
 parallel recovery started with 3 processes
ORA-279 signalled during: ALTER DATABASE RECOVER  database using backup controlfile until cancel  ...
Fri Jan 11 15:27:28 2013
ALTER DATABASE RECOVER    CONTINUE DEFAULT  
Fri Jan 11 15:27:28 2013
Media Recovery Log /soft/db_arch/1_87749_604491553.dbf
Errors with log /soft/db_arch/1_87749_604491553.dbf
ORA-308 signalled during: ALTER DATABASE RECOVER    CONTINUE DEFAULT  ...
Fri Jan 11 15:27:28 2013
ALTER DATABASE RECOVER    CONTINUE DEFAULT  
Fri Jan 11 15:27:28 2013
Media Recovery Log /soft/db_arch/1_87749_604491553.dbf
Errors with log /soft/db_arch/1_87749_604491553.dbf
ORA-308 signalled during: ALTER DATABASE RECOVER    CONTINUE DEFAULT  ...
Fri Jan 11 15:27:28 2013
ALTER DATABASE RECOVER CANCEL 
ORA-1547 signalled during: ALTER DATABASE RECOVER CANCEL ...

DBA忽略了这个错误,尝试将数据库打开,很显然会遇到O RA-1589 错误,之后又尝试用 NORESTLOGS 方式 OPEN 数据库,这也很显然会遇到 ORA-1588 错误。不完全恢复的数据库必须要以 RESETLOGS 方式打开。

Fri Jan 11 15:29:52 2013
alter database open
Fri Jan 11 15:29:52 2013
ORA-1589 signalled during: alter database open...
Fri Jan 11 15:30:11 2013
alter database open NORESETLOGS
Fri Jan 11 15:30:11 2013
ORA-1588 signalled during: alter database open NORESETLOGS

之后, DBA 作了一个艰难的决定,再次关闭并重启了数据库。又再次尝试相同的 OPEN 步骤。当然,Oracle 也给与了相同的报错。数据库仍然无法打开。至此,数据库无法提供服务已经1个多小时。

Fri Jan 11 15:30:42 2013
Shutting down instance: further logons disabled
Fri Jan 11 15:30:42 2013
Stopping background process CJQ0
Fri Jan 11 15:30:42 2013
Stopping background process MMNL
Fri Jan 11 15:30:43 2013
Stopping background process MMON
Fri Jan 11 15:30:44 2013
Shutting down instance (immediate)
……
Fri Jan 11 15:30:59 2013
Starting ORACLE instance (normal)
……
Fri Jan 11 15:31:08 2013
ALTER DATABASE OPEN
ORA-1589 signalled during: ALTER DATABASE OPEN...
Fri Jan 11 15:31:28 2013
alter database open NORESETLOGS
Fri Jan 11 15:31:28 2013
ORA-1588 signalled during: alter database open NORESETLOGS...
Fri Jan 11 15:31:41 2013
alter database open RESETLOGS
Fri Jan 11 15:31:41 2013
ORA-1122 signalled during: alter database open RESETLOGS...

再接下来,是一团混乱, DBA 多次重启数据库,尝试了多种恢复手段。offline数据文件,recover 数据文件,recover 数据库,online 数据文件,再recover,再 offline,再 open,但是一切尝试都是徒劳的。一直到晚上18:35,在数据库宕机4个多小时以后,开始求助我们帮助其恢复数据库。

Fri Jan 11 15:41:28 2013
alter database datafile '/data2/DECTR_HIS2.dbf' offline
Completed: alter database datafile '/data2/DECTR_HIS2.dbf' offline
Fri Jan 11 15:41:35 2013
alter database open
Fri Jan 11 15:41:35 2013
ORA-1589 signalled during: alter database open...
Fri Jan 11 15:42:20 2013
alter database  open resetlogs
Fri Jan 11 15:42:20 2013
ORA-1245 signalled during: alter database  open resetlogs...
Fri Jan 11 15:43:40 2013
ALTER DATABASE RECOVER  datafile '/data3/ts2_dpcis.dbf'  
Fri Jan 11 15:43:40 2013
Media Recovery Start
Fri Jan 11 15:43:40 2013
Media Recovery failed with error 1610
ORA-283 signalled during: ALTER DATABASE RECOVER  datafile '/data3/ts2_dpcis.dbf'  ...
Fri Jan 11 15:46:09 2013
ALTER DATABASE RECOVER  datafile 10  
Fri Jan 11 15:46:09 2013
Media Recovery Start
Fri Jan 11 15:46:09 2013
Media Recovery failed with error 1610
ORA-283 signalled during: ALTER DATABASE RECOVER  datafile 10  ...
……
Fri Jan 11 16:37:51 2013
ALTER DATABASE RECOVER  database  
Fri Jan 11 16:37:51 2013
Media Recovery Start
Fri Jan 11 16:37:51 2013
Media Recovery failed with error 1610
ORA-283 signalled during: ALTER DATABASE RECOVER  database  ...
Fri Jan 11 16:39:29 2013
ALTER DATABASE RECOVER  database using backup controlfile until cancel  
Fri Jan 11 16:39:29 2013
Media Recovery Start
 parallel recovery started with 3 processes
ORA-279 signalled during: ALTER DATABASE RECOVER  database using backup controlfile until cancel  ...
Fri Jan 11 16:39:43 2013
ALTER DATABASE RECOVER    CANCEL  
Fri Jan 11 16:39:44 2013
ORA-1547 signalled during: ALTER DATABASE RECOVER    CANCEL  ...
Fri Jan 11 16:39:44 2013
ALTER DATABASE RECOVER CANCEL 
ORA-1112 signalled during: ALTER DATABASE RECOVER CANCEL ...
Fri Jan 11 16:40:15 2013
alter database datafile 10 online
Fri Jan 11 16:40:15 2013
Completed: alter database datafile 10 online
Fri Jan 11 16:40:25 2013
alter database datafile 31 online
Completed: alter database datafile 31 online
Fri Jan 11 16:40:47 2013
ALTER DATABASE RECOVER  database using backup controlfile until cancel  
Fri Jan 11 16:40:47 2013
Media Recovery Start
Fri Jan 11 16:40:47 2013
Media Recovery failed with error 1110
ORA-283 signalled during: ALTER DATABASE RECOVER  database using backup controlfile until cancel  ...
Fri Jan 11 16:47:12 2013
WARNING: inbound connection timed out (ORA-3136)
Fri Jan 11 17:44:47 2013
ALTER DATABASE RECOVER  datafile 10  
Fri Jan 11 17:44:47 2013
Media Recovery Start
Fri Jan 11 17:44:47 2013
Media Recovery failed with error 1610
ORA-283 signalled during: ALTER DATABASE RECOVER  datafile 10  ...
Fri Jan 11 17:45:19 2013
ALTER DATABASE RECOVER  database until cancel using backup controlfile  
Fri Jan 11 17:45:19 2013
Media Recovery Start
Fri Jan 11 17:45:19 2013
Media Recovery failed with error 1110
ORA-283 signalled during: ALTER DATABASE RECOVER  database until cancel using backup controlfile  ...
Fri Jan 11 17:46:39 2013
alter database datafile 10 offline
Fri Jan 11 17:46:40 2013
Completed: alter database datafile 10 offline
Fri Jan 11 17:47:18 2013
ALTER DATABASE RECOVER  database until cancel  
Fri Jan 11 17:47:18 2013
Media Recovery Start
Fri Jan 11 17:47:18 2013
Media Recovery failed with error 1610
ORA-283 signalled during: ALTER DATABASE RECOVER  database until cancel  ...
Fri Jan 11 18:11:31 2013
alter database open
Fri Jan 11 18:11:31 2013
ORA-1589 signalled during: alter database open...
Fri Jan 11 18:35:29 2013
Starting ORACLE instance (normal)
Fri Jan 11 18:35:43 2013
alter database open
Fri Jan 11 18:35:43 2013
ORA-1589 signalled during: alter database open...

这是一个没有备份的数据库,实际上如果是存储字系统的问题导致了数据文件损坏,那么可能与DBA的关系并不大,但是在经过一下午的折腾,将一个其实仅仅是坏了2个数据文件而可以轻松 OPEN 的数据库恢复到无论如何也无法轻易打开的状态,这就与 DBA 有很大的关系了。