Windows 2003 系统应用故障的分析
2009-02-06
背景
2月5日早上4:00就打来值班电话,公司的生产统计系统上不了,急急到用户终端进行检查,服务器10.1.106.193能PING通,就是通过[url]http://10.1.106.193:8081[/url]访问不正常。到机房检查服务器系统,3个服务端程序的DOS窗口都没有。运行3个批处理文件就启动了服务端程序。再测试系统正常了。 考虑系统故障引起的应用服务停止,为了谨慎还是先重新启动服务器,再启动应用服务,测试应用正常。
是什么原因导致这些服务停止了呢?是人为失误忘记执行批处理文件,还是系统故障导致的呢?
为了搞清楚这个原因,下决心搞清楚引起故障的根源。
1 服务器日志的分析
1.1 “安全性”记录
图1:绿色框内:是解决故障时人为进行服务器重新启动时在系统启动过程中的日志记录,时间是4:29:44 到4:29:50,共6秒;在绿色框下一行的“系统事件”是系统记录的关机日志。
红色框内:时间是从2:45:50到2:45:57,共7秒。其日志记录与系统重新启动日志一样,是不是系统也自动重新启动了?
1.2 凌晨4到5点的“系统”记录
图2的4:23到4:31是反应系统启动过程的日志记录。这里注意时间间隔,
图2在4:23:53到4:24:00,共7秒的时间,是输入用户名和密码后系统登录过程中出现的日志记录。
图2的4:23:53×××警告提示详细内容如图3,这个内容也与我去查服务器故障时第一次登录该服务器提示是一致的。说明我在登录系统前,系统已经出现故障。
4:24:00错误提示的详细内容如图4
图4
4:29:39系统日志服务器启动,日志开始记录。
4:26:36到4:29:39,共183秒的时间,这是系统重新启而系统日志服务处于关闭状态的过程,系统在这段时间之间没有日志记录。
小经以上对临时4点多的时间记录的日志分析,
小结:值班人员执行重新启动到系统重新启动成功的时间为:7+183+81=271秒。
4:29:39到4:31:00 ,共81秒,系统重新启动过程中日志开记录到系统成功启动的日志记录。4:31:00是系统启动成功的时刻记录,
图5,2:45:04到2:47:00,其日志来源与值班人员人为重新启动服务器的日志来源几乎完全一致。而凌晨的这段时间并没有人为启动。
与4点多的时间段日志记录不同的是:事件来源的eventlog时间及信息,说明日志服务没有关闭记录。
2:45:46的红色出错信息为。详细如图6, 提示2:42:51系统意外关闭。如果是重新启动为什么日志中没有日志服务的关闭记录,只有这种情况如果服务器突然掉电或类似突然掉电的重新启动日志记录中不会有日志服务关闭记录。再看与2:42:51最接近的记录时间是2:45:04 ,间隔133秒(2分13秒),这个133秒(掉电重新启动)与270秒(发出重新启动后系统正常关闭服务后再重新启动)的时间差也非常吻合。
小结:服务器极有可能突然掉电发生的重新启动,或类似这样的系统故障。
2 查看网络记录
我首先检查了网络交换机上的日志。必须明确该服务器连接的交换机端口。
服务器端口的进行快速定位。
center-1#show arp | in 106.193
Internet 10.1.106.193 5 0017.0857.7280 ARPA Vlan2
center-1#show mac- | in 0857.7280
2 0017.0857.7280 dynamic ip GigabitEthernet2/6
center-1#show cdp n g2/6 de
-------------------------
Device ID: Ghsw-A101-04-03
Entry address(es):
IP address: 10.1.107.11
Platform: cisco WS-C3560G-24TS, Capabilities: Router Switch IGMP
Interface: GigabitEthernet2/6, Port ID (outgoing port): GigabitEthernet0/25
Holdtime : 125 sec
Cisco IOS Software, C3560 Software (C3560-IPSERVICES-M), Version 12.2(25)SEE4, RELEASE SOFTWARE (fc1)
Copyright (c) 1986-2007 by Cisco Systems, Inc.
Compiled Mon 16-Jul-07 00:28 by myl
Protocol Hello: OUI=0x00000C, Protocol ID=0x0112; payload len=27, value=00000000FFFFFFFF010221FF000000000000001E4993BE00FF0000
VTP Management Domain: 'bjgh'
Native VLAN: 1
Duplex: full
Trying 10.1.107.11 ... Open
User Access Verification
Ghsw-A101-04-03>en
Password:
Ghsw-A101-04-03#show mac- | 0857.7280
^
% Invalid input detected at '^' marker.
2 0017.0857.7280 DYNAMIC Gi0/3
Ghsw-A101-04-03#show log | in 0/3
.Feb 4 09:55:16: %LINK-3-UPDOWN: Interface GigabitEthernet0/3, changed state to up
.Feb 4 09:55:17: %LINEPROTO-5-UPDOWN: Line protocol . Interface GigabitEthernet0/3, changed state to up
.Feb 5 02:57:46: %LINEPROTO-5-UPDOWN: Line protocol . Interface GigabitEthernet0/3, changed state to down
.Feb 5 02:57:47: %LINK-3-UPDOWN: Interface GigabitEthernet0/3, changed state to down
.Feb 5 02:57:50: %LINK-3-UPDOWN: Interface GigabitEthernet0/3, changed state to up
.Feb 5 02:57:51: %LINEPROTO-5-UPDOWN: Line protocol . Interface GigabitEthernet0/3, changed state to up
.Feb 5 02:59:12: %LINEPROTO-5-UPDOWN: Line protocol . Interface GigabitEthernet0/3, changed state to down
.Feb 5 02:59:13: %LINK-3-UPDOWN: Interface GigabitEthernet0/3, changed state to down
.Feb 5 02:59:16: %LINK-3-UPDOWN: Interface GigabitEthernet0/3, changed state to up
.Feb 5 02:59:17: %LINEPROTO-5-UPDOWN: Line protocol . Interface GigabitEthernet0/3, changed state to up
.Feb 5 04:41:46: %LINEPROTO-5-UPDOWN: Line protocol . Interface GigabitEthernet0/3, changed state to down
.Feb 5 04:41:47: %LINK-3-UPDOWN: Interface GigabitEthernet0/3, changed state to down
.Feb 5 04:41:50: %LINK-3-UPDOWN: Interface GigabitEthernet0/3, changed state to up
.Feb 5 04:41:51: %LINEPROTO-5-UPDOWN: Line protocol . Interface GigabitEthernet0/3, changed state to up
.Feb 5 04:43:12: %LINEPROTO-5-UPDOWN: Line protocol . Interface GigabitEthernet0/3, changed state to down
.Feb 5 04:43:13: %LINK-3-UPDOWN: Interface GigabitEthernet0/3, changed state to down
.Feb 5 04:43:15: %LINK-3-UPDOWN: Interface GigabitEthernet0/3, changed state to up
.Feb 5 04:43:16: %LINEPROTO-5-UPDOWN: Line protocol . Interface GigabitEthernet0/3, changed state to up
交换机日志时间减去14分,刚好与服务器系统日志记录时间一致。
交换机提示服务器掉电时间:Feb 5 02:57:46: (减去14分);即2:43:42;系统服务器记录时间:2:42:51。(如果服务器类似掉电似的突然重新启动,服务器对这个时间记录是一个比实际时间偏小的估计值,相差约1秒)。说明服务器日志的提示时间和交换机端口DOWN的时间一致。
.Feb 5 04:41:46: 至Feb 5 04:43:16: (减去14分),是发现故障让服务器重新启动的交换机端口UP/DOWN的日志记录。
是什么原因导致服务器突然掉电重新启动?
3 对DMP文件的分析
系统MEMORY.DMP文件的时间2:45
下面是DMP文件的分析:
Microsoft (R) Windows Debugger Version 6.10.0003.233 X86
Copyright (c) Microsoft Corporation. All rights reserved.
Loading Dump File [C:\Documents and Settings\zhou.j\My Documents\106.193\MEMORY.DMP]
Kernel Summary Dump File: .ly kernel address space is available
****************************************************************************
* Symbol loading may be unreliable without a symbol search path. *
* Use .symfix to have the debugger choose a symbol path. *
* After setting your symbol path, use .reload to refresh symbol locations. *
****************************************************************************
Executable search path is:
*********************************************************************
* Symbols can not be loaded because symbol path is not initialized. *
* *
* The Symbol Path can be set by: *
* using the _NT_SYMBOL_PATH environment variable. *
* using the -y <symbol_path> argument when starting the debugger. *
* using .sympath and .sympath+ *
*********************************************************************
*** ERROR: Symbol file could not be found. Defaulted to export symbols for ntkrnlmp.exe -
Windows Server 2003 Kernel Version 3790 (Service Pack 2) MP (4 procs) Free x86 compatible
Product: Server, suite: TerminalServer SingleUserTS
Built by: 3790.srv03_sp2_gdr.080813-1204
Machine Name:
Kernel base = 0x80800000 PsLoadedModuleList = 0x808af9c8
Debug session time: Thu Feb 5 02:43:02.584 2009 (GMT+8)
System Uptime: 0 days 17:01:59.693
*********************************************************************
* Symbols can not be loaded because symbol path is not initialized. *
* *
* The Symbol Path can be set by: *
* using the _NT_SYMBOL_PATH environment variable. *
* using the -y <symbol_path> argument when starting the debugger. *
* using .sympath and .sympath+ *
*********************************************************************
*** ERROR: Symbol file could not be found. Defaulted to export symbols for ntkrnlmp.exe -
Loading Kernel Symbols
...............................................................
......................................................
Loading User Symbols
PEB is paged out (Peb.Ldr = 7ffdb00c). Type ".hh dbgerr001" for details
Loading unloaded module list
....
*** ERROR: Symbol file could not be found. Defaulted to export symbols for storport.sys -
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************
Page c0f8d not present in the dump file. Type ".hh dbgerr004" for details
*** ERROR: Module load completed but symbols could not be loaded for HpCISSs2.sys
*************************************************************************
*** ***
*** ***
*** Your debugger is not using the correct symbols ***
*** ***
*** In order for this command to work properly, your symbol path ***
*** must point to .pdb files that have full type information. ***
*** ***
*** Certain .pdb files (such as the public OS symbols) do not ***
*** contain the required information. Contact the group that ***
*** provided you with these symbols if you need this command to ***
*** work. ***
*** ***
*** Type referenced: nt!_KPRCB ***
*** ***
*************************************************************************
*************************************************************************
*** ***
*** ***
*** Your debugger is not using the correct symbols ***
*** ***
*** In order for this command to work properly, your symbol path ***
*** must point to .pdb files that have full type information. ***
*** ***
*** Certain .pdb files (such as the public OS symbols) do not ***
*** contain the required information. Contact the group that ***
*** provided you with these symbols if you need this command to ***
*** work. ***
*** ***
*** Type referenced: nt!KPRCB ***
*** ***
*************************************************************************
*************************************************************************
*** ***
*** ***
*** Your debugger is not using the correct symbols ***
*** ***
*** In order for this command to work properly, your symbol path ***
*** must point to .pdb files that have full type information. ***
*** ***
*** Certain .pdb files (such as the public OS symbols) do not ***
*** contain the required information. Contact the group that ***
*** provided you with these symbols if you need this command to ***
*** work. ***
*** ***
*** Type referenced: nt!_KPRCB ***
*** ***
*************************************************************************
*************************************************************************
*** ***
*** ***
*** Your debugger is not using the correct symbols ***
*** ***
*** In order for this command to work properly, your symbol path ***
*** must point to .pdb files that have full type information. ***
*** ***
*** Certain .pdb files (such as the public OS symbols) do not ***
*** contain the required information. Contact the group that ***
*** provided you with these symbols if you need this command to ***
*** work. ***
*** ***
*** Type referenced: nt!KPRCB ***
*** ***
*************************************************************************
*************************************************************************
*** ***
*** ***
*** Your debugger is not using the correct symbols ***
*** ***
*** In order for this command to work properly, your symbol path ***
*** must point to .pdb files that have full type information. ***
*** ***
*** Certain .pdb files (such as the public OS symbols) do not ***
*** contain the required information. Contact the group that ***
*** provided you with these symbols if you need this command to ***
*** work. ***
*** ***
*** Type referenced: nt!_KPRCB ***
*** ***
*************************************************************************
*************************************************************************
*** ***
*** ***
*** Your debugger is not using the correct symbols ***
*** ***
*** In order for this command to work properly, your symbol path ***
*** must point to .pdb files that have full type information. ***
*** ***
*** Certain .pdb files (such as the public OS symbols) do not ***
*** contain the required information. Contact the group that ***
*** provided you with these symbols if you need this command to ***
*** work. ***
*** ***
*** Type referenced: nt!_KPRCB ***
*** ***
*************************************************************************
*** ERROR: Symbol file could not be found. Defaulted to export symbols for halmacpi.dll -
*************************************************************************
*** ***
*** ***
*** Your debugger is not using the correct symbols ***
*** ***
*** In order for this command to work properly, your symbol path ***
*** must point to .pdb files that have full type information. ***
*** ***
*** Certain .pdb files (such as the public OS symbols) do not ***
*** contain the required information. Contact the group that ***
*** provided you with these symbols if you need this command to ***
*** work. ***
*** ***
*** Type referenced: nt!_KPRCB ***
*** ***
*************************************************************************
*************************************************************************
*** ***
*** ***
*** Your debugger is not using the correct symbols ***
*** ***
*** In order for this command to work properly, your symbol path ***
*** must point to .pdb files that have full type information. ***
*** ***
*** Certain .pdb files (such as the public OS symbols) do not ***
*** contain the required information. Contact the group that ***
*** provided you with these symbols if you need this command to ***
*** work. ***
*** ***
*** Type referenced: nt!_KPRCB ***
*** ***
*************************************************************************
*********************************************************************
* Symbols can not be loaded because symbol path is not initialized. *
* *
* The Symbol Path can be set by: *
* using the _NT_SYMBOL_PATH environment variable. *
* using the -y <symbol_path> argument when starting the debugger. *
* using .sympath and .sympath+ *
*********************************************************************
*********************************************************************
* Symbols can not be loaded because symbol path is not initialized. *
* *
* The Symbol Path can be set by: *
* using the _NT_SYMBOL_PATH environment variable. *
* using the -y <symbol_path> argument when starting the debugger. *
* using .sympath and .sympath+ *
*********************************************************************
小结:Probably caused by : storport.sys ( storport!StorPortGetPhysicalAddress+2a )
---------
4 查找有关storport.sys
可以在hP技术支持网站上有如下文章。
文章ID:43167
文章标题:在繁重的输入/输出负荷下,HP ProLiant 服务器将有可能遇到蓝屏问题
文章关键字:,c01420773
文章路径:http://www.icare.hp.com.cn/techcenter_staticarticle/43167/43167.html
出现这种情况是因为安装 Storport 驱动程序的 Microsoft KB932755 更新文件之后,HP Insight Management Storage Agents 发出 I/O Control (IOCTL) 调用命令会有问题。
受影响的 HP Smart Array SCSI 控制器:
Smart Array 6400/6402/6404 EM 控制器
Smart Array 641/642 控制器
Smart Array 6i 控制器
Smart Array 5312 控制器
Smart Array 5300/5302/5304 控制器
Smart Array 532 控制器
Smart Array 5i 控制器
受影响的 HP Smart Array SAS/SATA 控制器:
Smart Array P800 控制器
Smart Array P600 控制器
Smart Array E500 控制器
Smart Array P400/400i 控制器
Smart Array E200/200i 控制器
受影响的软件配置:
Microsoft Windows Server 2003(x86 或 x64)任何版本。
及
HP ProLiant Smart Array 5x/6x Controller Driver (HPCISSS.SYS) 版本 5.18.0.64(或更低版本) 或 HP ProLiant Smart Array SAS/SATA Controller Driver (HPCISSS2.SYS) 版本 5.10.0.32 或 5.10.0.64(或更低版本)。
及
Microsoft KB932755 带来的 Microsoft Storport Driver for Windows Server 2003 版本 5.2.3790.2880(适用于 SP1) 或 5.2.3790.4021(适用于 SP2)。
及
HP Insight Management Storage Agents(任何版本)。
在下列更新版本中,蓝屏问题已经得到纠正:
对于运行 Windows Server 2003 64 位版本的 ProLiant 服务器:
(HPCISSS.SYS) HP ProLiant Smart Array 5x and 6x Controller Driver for Windows Server 2003 x64 Editions 版本 6.4.0.64(或更高版本)
(HPCISSS2.SYS) HP ProLiant Smart Array SAS/SATA Controller Driver for Windows Server 2003 x64 Editions 版本 6.2.0.64(或更高版本)
对于运行 Windows Server 2003 32 位版本的 ProLiant 服务器:
(HPCISSS2.SYS) HP ProLiant Smart Array SAS/SATA Controller Driver for Windows Server 2003 版本 6.2.0.32(或更高版本)
寻找驱动程序更新版本:
1.访问
www.hp.com
2.选择“Software and Driver Downloads”。
3.输入 ProLiant 服务器机型(例如“DL380 G5”)。
4.在“Product Search Results”页面(如有此页)中选择具体的服务器机型。
5.选择相应的 Windows Server 2003 版本。
6.选择 Driver - Storage Controller
。
7.下载相应驱动程序的最新版本 。
在安装正确的 HPCISSS.SYS 或 HPCISSS2.SYS 版本之前,利用“控制面板->添加或删除程序”删除 Storport 驱动程序的 KB932755 更新文件可以避免蓝屏问题(参见下图 1)。
接受前瞻更新 : 通过电子邮件与 HP Subscriber"s Choice(惠普用户选择服务)预先获得支持提示(例如客户顾问文档),以及驱动程序更新文件、软件、固件与客户可更换组件。 访问下列网址注册 Subscriber"s Choice(用户选择服务):
http://www.hp.com/go/myadvisory
搜寻提示 : 关于访问 HP.com,为 ProLi
下载这个驱动安装。监控这台服务器,自安装了这个驱动之后,服务器再也没有发生自动重新启动。
最终结论
2月5日106.193服务器发生蓝屏重新启动,启动后,X应用服务需要人为启动,导致当天报表不能使用。 蓝屏原因是: 运行 Microsoft Windows Server 2003 SP2 和 Microsoft Storport 存储端口驱动程序 (STORPORT.SYS) 的某些 ProLiant 服务器可能蓝屏,安装HP发布的阵列卡硬件驱动问题解决,。