今天没有外出(似乎人不到现场就特别容易出问题),早上10点左右接到电话被告知crm11实例上出现了7445错误,准备用web vpn拨上去查看一下,赫然发觉windows 7 不支持这种vpn(准确说ie8和firefox都不支持);无奈无奈只好用拨号。
发现alert log中出现大量 7445错误记录:
Fri Mar 26 09:24:53 2010
Errors in file /oravl01/oracle/admin/CRMDB1/udump/crmdb11_ora_6754320.trc:
ORA-07445: exception encountered: core dump [kghalp+0500] [SIGSEGV] [Invalid permissions for mapped object] [0x00000003B] [] []
Fri Mar 26 09:24:55 2010
Trace dumping is performing id=[cdmp_20100326092455]
Fri Mar 26 09:31:16 2010
Errors in file /oravl01/oracle/admin/CRMDB1/udump/crmdb11_ora_2994552.trc:
ORA-07445: exception encountered: core dump [kghalp+0500] [SIGSEGV] [Invalid permissions for mapped object] [0x00000003B] [] []
看到kghalp函数第一印象 ,是Oracle中堆管理使用的函数;
让我们猜猜字面意思? k -> kernel g -> generic h-> heap a-> allocation p-> point
再让我们来看一下当时的call stack:
Exception signal: 11 (SIGSEGV), code: 51 (Invalid permissions for mapped object), addr: 0x3b, PC: [0x1000973e0, kghalp+0500]
Registers:
iar: 00000001000973e0, msr: a00000000000d0b2
lr: 00000001013a6df8, cr: 0000000022292484
r00: 0000000000000010, r01: 0ffffffffffcb160, r02: 000000011022a9c0,
r03: 0000000000000002, r04: 0000000000000000, r05: 0000000000000100,
r06: 0000000000000001, r07: 0000000000000000, r08: 0000000000000000,
r09: 0000000000000000, r10: 00000000101b60d8, r11: 0000000000000004,
r12: 0000000024592484, r13: 000000011026bfe0, r14: 0000000000000000,
r15: 0000000000009000, r16: 0000000110195b2c, r17: 0000000000000000,
r18: 0000000000000001, r19: 0000000000000000, r20: 0000000000001000,
r21: 0000000000000000, r22: 0000000000000100, r23: 0000000000000001,
r24: 0000000000000000, r25: 0000000000000000, r26: 0000000000000001,
r27: 0000000104c7fd44, r28: 0000000000000000, r29: 0000000000000100,
r30: 0000000000000000, r31: 0000000110195a58,
*** 2010-03-26 09:57:28.679
ksedmp: internal or fatal error
ORA-07445: exception encountered: core dump [kghalp+0500] [SIGSEGV] [Invalid permissions for mapped object] [0x00000003B] [] []
Current SQL statement for this session:
INSERT INTO AUDIT_DDL_LOG (DDL_TIME, SESSION_ID, OS_USER, IP_ADDRESS, TERMINAL, HOST, USER_NAME, DDL_TYPE, OBJECT_TYPE, OWNER, OBJECT_NAME, SQL_TEXT) VALUES (SYSDATE, SYS_CONTEXT('USERENV','SESSIONID'), SYS_CONTEXT('USERENV','OS_USER'), SYS_CONTEXT('USERENV','IP_ADDRESS'), SYS_CONTEXT('USERENV','TERMINAL'), SYS_CONTEXT('USERENV','HOST'), ORA_LOGIN_USER, ORA_SYSEVENT, ORA_DICT_OBJ_TYPE, ORA_DICT_OBJ_OWNER, ORA_DICT_OBJ_NAME, :B1 )
----- PL/SQL Call Stack -----
object line object
handle number name
70000043da500d0 10 anonymous block
----- Call Stack Trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedst+001c bl ksedst1 000000000 ? 104A54EED ?
ksedmp+0290 bl ksedst 104A54870 ?
ssexhd+03e0 bl ksedmp 300001D15 ?
000044C0 ? 00000000
parchk+01f4 bl kghalp 000000000 ?
2842288200000001 ?
000000000 ? 000000000 ?
000001040 ? 110195B2C ?
ptmak+0168 bl parchk FFFFFFFFFFCB560 ?
FFFFFFFFFFCB430 ?
FFFFFFFFFFCB430 ?
pdybF00_Init+0244 bl ptmak 10008049C ? 000000000 ?
FFFFFFFFFFCB4F0 ? 07FFFFFFF ?
pdy1F79_Init+00c8 bl pdybF00_Init 110BEB1D0 ?
pdy1F01_Driver+0048 bl pdy1F79_Init FFFFFFFFFFCBC40 ?
pdli_new_cog+00f0 bl pdy1F01_Driver FFFFFFFFFFCBCE0 ? 000000000 ?
pdlifu+0264 bl pdli_new_cog 1013885F4 ? FFFFFFFFFFCCB00 ?
7000004383E7680 ?
phpcog+0010 bl pdlifu FFFFFFFFFFCD958 ?
7000004383E7680 ? 104C95048 ?
phpcmp+0f80 bl phpcog FFFFFFFFFFCC4F0 ? 000000000 ?
pcicms2+02d4 bl phpcmp FFFFFFFFFFCD958 ?
发生错误的最上层 kghalp 函数由 parchk 调用, 这似乎是一个package check函数(猜测,呵呵). 我们来整理一下思路, parchk 函数调用了 kghalp函数以帮其分配内存,但却得到了一个非法的低地址[[0x00000003B],正常情况下正文段使用的空间; 这看起来显然是一个bug。
让我们来查查support.oracle.com , 键入7445 kghalp 和sigsegv 关键字 (很多时候不需要使用ora 600/7445 lookup tools).
bug 8244533 赫然显目:
Bug 8244533: ORA-07445 [KGHALP] ERRORS COMPILING PACKAGE WITH DEBUG
STACK TRACE:
------------
ksedst <- ksedmp <- ssexhd <- 000044BC <- parchk <- ptmak <-
pdybF00_Init <- pdy1F79_Init <- pdy1F01_Driver <- pdli_new_cog <-
pdlifu <- phpcog <- phpcmp <- pcicms2 <- pcicms <- kkxcms <- kkxswcm
<- kkxmpbms <- kkxmesu <- xtypls <- qctopls <- qctcopn <- qctcopn
Exception signal: 11 (SIGSEGV), code: 51 (Invalid permissions for mapped
object),
addr: 0x3b, PC: [0x1000973e0, kghalp+0500]
Registers:
iar: 00000001000973e0, msr: a00000000000d0b2
lr: 000000010139ffb8, cr: 00000000222a2484
r00: 0000000000000010, r01: 0ffffffffffe2980, r02: 00000001101e5ab8,
r03: 0000000000000002, r04: 0000000000000000, r05: 0000000000000100,
r06: 0000000000000001, r07: 0000000000000000, r08: 0000000000000000,
r09: 0000000000000000, r10: 0000000010171200, r11: 0000000000000004,
r12: 00000000245a2484, r13: 000000011021fbc0, r14: 0000000000000000,
r15: 0000000000009000, r16: 0000000110150c54, r17: 0000000000000000,
r18: 0000000000000001, r19: 0000000000000000, r20: 0000000000001000,
r21: 0000000000000000, r22: 0000000000000100, r23: 0000000000000001,
r24: 0000000000000000, r25: 0000000000000000, r26: 0000000000000001,
r27: 0000000104c5983c, r28: 0000000000000000, r29: 0000000000000100,
r30: 0000000000000000, r31: 0000000110150b80,
*** 16:37:14.603
ksedmp: internal or fatal error
ORA-7445: exception encountered: core dump [kghalp+0500] [SIGSEGV]
[Invalid permissions for mapped object] [0x00000003B] [] []
Current SQL statement for this session:
select dummy from dual where ora_dict_obj_type = 'TABLE'
----- Call Stack Trace -----ptmak pdybF00_Init pdy1F79_Init pdy1F01_Driver pdli_new_cog pdlifuphpcog phpcmp pcicms2 pcicms kkxcms kkxswcm kkxmpbms kkxmesu xtyplsTo Filer.Based on this call stack this would appear a likely match forbug 6951953 Abstract: ORA-7445 [PTMAK] IMPORTING PACKAGE COMPILED DEBUG.This bug is fixed on 10.2.0.5 and there is a 10.2.0.4 patch available for IBM AIX Based Systems (64-bit).It maybe worth while to have the customer apply the patch to seeif it resolves the issue.Also the uploaded files included test.sql is this a reproducable testcase?
这个bug 似乎仅在 IBM AIX on POWER Systems (64-bit) 发生,当以DEBUG 模式编译包时有一定几率出现。
好了,既然已经了解了可能发生的诱因,我们可以进一步分析了,接下来看看 errorstack trace信息中 的SO 记录。
SO: 70000043d217668, type: 53, owner: 70000048cee2238, flag: INIT/-/-/0x00
LIBRARY OBJECT LOCK: lock=70000043d217668 handle=700000446261588 mode=N
call pin=0 session pin=0 hpc=0000 hlc=0000
htl=70000043d2176e8[70000042b52b368,70000042bb9a808] htb=70000044929b460 ssga=70000044929ad68
user=70000048cee2238 session=70000048eb33010 count=1 flags=[0000] savepoint=0x4bac1488
LIBRARY OBJECT HANDLE: handle=700000446261588 mtx=7000004462616b8(1) cdp=1
name=ALTER TRIGGER "SHUCRM3O"."TRI_PRODUCT_INSTANCE_RELATED" COMPILE DEBUG REUSE SETTINGS
hash=164e6a8942406cee159f8943a1a3c85e timestamp=03-26-2010 09:52:12
namespace=CRSR flags=RON/KGHP/TIM/PN0/SML/KST/DBN/MTX/[120100d0]
kkkk-dddd-llll=0000-0001-0001 lock=N pin=0 latch#=16 hpc=0002 hlc=0002
lwt=700000446261630[700000446261630,700000446261630] ltm=700000446261640[700000446261640,700000446261640]
pwt=7000004462615f8[7000004462615f8,7000004462615f8] ptm=700000446261608[700000446261608,700000446261608]
ref=700000446261660[700000446261660,700000446261660] lnd=700000446261678[700000446261678,700000446261678]
LIBRARY OBJECT: object=70000045adbc1e8
type=CRSR flags=EXS[0001] pflags=[0000] status=VALD load=0
CHILDREN: size=16
child# table reference handle
5 70000041776f5c0 70000045ae44720 70000042bfa3a20
DATA BLOCKS:
data# heap pointer status pins change whr
0 70000043d9fed20 70000045adbc300 I/P/A/-/- 0 NONE 00
的确有以debug 模式编译对象的语句,不过对象不是包而是trigger ; 看起来只要是可以以debug 模式compile 的对象都有可能引发该问题。
好了,问题到这里已经比较明确了: 应用端以DEBUG模式重新编译包引发了 Oracle bug 8244533,从而导致了对应服务进程的崩溃;总算是虚惊一场,之后通过trace内的machine和user信息找到了实施变更的应用方人员并教育之。