RAC中增大db_cache_size引发的ORA-04031错误

原创

maclean_007 2011-11-01 21:42:13 ©著作权

文章标签 memory resource shadow gcs ASMM db_cache_size 文章分类 Oracle 数据库

©著作权归作者所有：来自51CTO博客作者maclean_007的原创作品，请联系作者获取转载授权，否则将追究法律责任

几个礼拜前，有一套10.2.0.2 的二节点RAC 数据库因为增大db_cache_size ，引发其中一个实例发生著名的ORA-04031 错误，日志如下:

Errors in file /oracle/oracle/admin/maclean/udump/u1_ora_13757.trc:
ORA-00603: ORACLE server session terminated by fatal error
ORA-00604: error occurred at recursive SQL level 1
ORA-04031: unable to allocate 1048 bytes of shared memory
("shared pool","select name,online$,contents...","Typecheck","kgghteInit")
ORA-00604: error occurred at recursive SQL level 1
ORA-04031: unable to allocate 1048 bytes of shared memory
("shared pool","select name,online$,contents...","Typecheck","seg:kggfaAllocSeg")
Thu Oct 13 08:25:05 2011
Log from www.oracledatabase12g.com  & www.askmaclean.com
Errors in file /oracle/oracle/admin/maclean/udump/u1_ora_1444.trc:
ORA-00603: ORACLE server session terminated by fatal error
ORA-00604: error occurred at recursive SQL level 1
ORA-04031: unable to allocate 4120 bytes of shared memory
("shared pool","select name,online$,contents...","Typecheck","kgghtInit")
ORA-00604: error occurred at recursive SQL level 1
ORA-04031: unable to allocate 4120 bytes of shared memory
("shared pool","select name,online$,contents...","Typecheck","kgghtInit")

以上错误出现的同时实例出现大量的row cache lock字典缓存和cursor:pin S wait on X等待事件，说明共享池中的row cache字典缓存和SQL area 执行计划因为Free Memory不足而被不断换出，导致硬解析增多并SQL解析性能下降，进一步造成了应用程序挂起，赶到现场后对该ORA-04031错误进行了分析。 SGA中的内存池包含不同大小的内存块。当数据库启动时，会有一块大的内存被分配并使用Free list的空闲列表追踪。随着时间推移，这些内存被不断分配和释放，内存块(chunk)被按照其大小在不同的Fress list中移动，当SGA里任何一个内存池出现不能满足内部分配一整块连续的内存块请求时，就可能出现ORA-04031错误。实际使用中造成ORA-04031错误的原因可能是Oracle软件bug、产品缺陷、应用程序设计不当、Oracle内存参数设置不当。这里出现ORA-04031错误的内存池是shared pool即共享池，为了搞清楚ORA-04031错误发生的实际原因，我们通过AWR报告分析共享池的使用情况。以下是 ORA-04031 问题发生前一天AWR报告中的共享池内存使用情况：

Pool	Name	Begin MB	End MB	% Diff
large	free memory	112.00	112.00	0.00
shared	ASH buffers	25.60	25.60	0.00
shared	CCursor	19.44	20.16	3.70
shared	Checkpoint queue	5.87	5.87	0.00
shared	PCursor	10.57	11.14	5.38
shared	event statistics per sess	7.72	7.72	0.00
shared	free memory	32.99	33.00	0.02
shared	gcs resources	78.75	78.75	0.00
shared	gcs shadows	49.61	49.61	0.00
shared	ges big msg buffers	15.03	15.03	0.00
shared	ges reserved msg buffers	7.86	7.86	0.00
shared	ges resource	5.28	5.28	0.00
shared	kglsim heap	16.63	16.63	0.00
shared	kglsim object batch	25.63	25.63	0.00
shared	library cache	21.32	22.01	3.23
shared	row cache	7.13	7.13	0.00
shared	sql area	64.06	61.55	-3.91
streams	free memory	64.00	64.00	0.00
	buffer_cache	3,936.00	3,936.00	0.00
	fixed_sga	2.08	2.08	0.00
	log_buffer	3.09	3.09	0.00

以下是 ORA-04031 问题发生时AWR报告中的共享池内存使用情况：

Pool	Name	Begin MB	End MB	% Diff
large	free memory	112.00	112.00	0.00
shared	ASH buffers	25.60	25.60	0.00
shared	Checkpoint queue	5.87	5.87	0.00
shared	KCL name table	9.00	9.00	0.00
shared	event statistics per sess	7.72	7.72	0.00
shared	free memory	25.56	25.52	-0.12
shared	gcs resources	143.39	143.39	0.00
shared	gcs shadows	90.33	90.33	0.00
shared	ges big msg buffers	15.03	15.03	0.00
shared	ges reserved msg buffers	7.86	7.86	0.00
shared	library cache	7.59	7.65	0.80
shared	row cache	7.13	7.13	0.00
shared	sql area	8.70	7.35	-15.57
streams	free memory	64.00	64.00	0.00
	buffer_cache	7,168.00	7,168.00	0.00
	fixed_sga	2.09	2.09	0.00
	log_buffer	3.09	3.09	0.00

红色部分标注了2个报告中差异最大的地方，在问题发生时共享池中gcs resources和gcs shadows 2种资源对比前一天增长了169M。 gcs资源在共享池中享有较高的优先级，而普通的SQL语句或执行计划享有较低的优先级，因为gcs资源所占用空间的大量膨胀，导致在没有调大共享池大小的情况下sql area和row cache内存资源被换出进而引发SQL解析性能下降和ORA-04031问题。 gcs resources和gcs shadow资源均是Oracle RAC中特有的全局缓存服务资源，这些资源负责处理RAC中的全局buffer cache。同时这些资源所占用共享池的空间视乎Oracle实例所使用高速缓存的大小而决定，Metalink文档说明了该问题: "The 'gcs resources' and 'gcs shadows' structures are used for handling buffer caches in RAC, so their memory usages are depending on buffer cache size. We can use V$RESOURCE_LIMIT to monitor them." 当实例高速缓存buffer cache被大小时gcs资源所占用的空间也相应增长，具体算法如下：

‘gcs_resources’ = initial_allocation * 120 bytes = “_gcs_resources parameter” * 120 bytes
‘gcs_shadows’ = initial_allocation * 72 bytes = “_gcs_shadow_locks parameter” * 72 bytes

select * from v$resource_limit where resource_name like '%gcs%';

RESOURCE_NAME CURRENT_UTILIZATION MAX_UTILIZATION INITIAL_ALLOCATION LIMIT_VALUE
------------------------------ ------------------- --------------- ---
gcs_resources 507772 514607 976083 976083
gcs_shadows 133862 139927 976083 976083

我们可以通过现有的v$resource_limit视图中的INITIAL_ALLOCATION估算Buffer cache增加后的INITIAL_ALLOCATION数量，例如我们准备将db_cache_size从10g增加到20g，那么可以通过下列公式算出有必要增加的共享池大小： add_to_shared_pool_size= 140 * Buffer_cache增加的兆数 * 192 bytes * 1.6 = 140 * 10* 1024 * 192 * 1.6 = 440401920 = 420M 问题总结 由于RAC环境中Oracle 使用共享池中的gcs resource/shadow 资源管理全局缓存，当实例的Buffer Cache总量增加时gcs resource/shadow 这些资源的数目也会相应上升，这导致共享池中可用的剩余空间大幅下降，又因为 gcs 全局缓存资源在共享池中享有较高的优先级( perm ，且在10.2中 gcs资源不能和其他如row cache或library cache 共享一个Extent的内存区间) ，引发了大量的row/dictionary cache字典缓存和SQL执行计划被换出共享池，引发大量的解析等待: cursor pin s on x 和 row cache lock ，