按照计划开始了生产库的升级,环境基于linux 64位. uname:

Linux 2.6.18-308.el5 #1 SMP Fri Jan 27 17:17:51 EST 2012 x86_64 x86_64 x86_64 GNU/Linux

数据库是10.2.0.5.0

要升级到11.2.0.2.0

已经提前打了最新的PSU


Interim patches (1) :

Patch 16056267 : applied on Thu Oct 03 16:01:47 ICT 2013

Unique Patch ID: 15994298

Patch description: "Database Patch Set Update : 11.2.0.2.10 (16056267)"

Created on 12 Mar 2013, 10:08:56 hrs PST8PDT

Sub-patch 14727315; "Database Patch Set Update : 11.2.0.2.9 (14727315)"

Sub-patch 14275621; "Database Patch Set Update : 11.2.0.2.8 (14275621)"

Sub-patch 13923804; "Database Patch Set Update : 11.2.0.2.7 (13923804)"

Sub-patch 13696224; "Database Patch Set Update : 11.2.0.2.6 (13696224)


升级的详细步骤在另一篇日志中已经讲到。

以下就是升级中碰到的问题,总之尽管在测试环境已经做了两遍演练,但是在生产还是碰到了不少的问题,而且有些ORA-600的问题让oracle原厂support的人也很无语,不过还是经过坚持不懈的努力,还是按时将数据库从10g升级到了11g

1.invalid components issue fix error

invalid component的问题修复在另一篇日志中已经有所描述。

​http://space.itpub.net/?uid-23718752-action-viewspace-itemid-766215​

因为生产中有部分component是invalid,因为没有用到,暂时没有影响,所以在升级之前需要修复这些问题,但是在修复的时候还是碰到了ora的错误。

照例按照步骤,先把invalid的jvm给remove,但是在re-install的时候出现了下面的问题。


SQL> /
COMP_ID COMP_NAME STATUS VERSION
------------------------------ -------------------------------------------------- ----------- ------------------------------
CATALOG Oracle Database Catalog Views VALID 10.2.0.5.0
CATPROC Oracle Database Packages and Types VALID 10.2.0.5.0
OWM Oracle Workspace Manager VALID 10.2.0.5.0
XDB Oracle XML Database VALID 10.2.0.5.0
ODM Oracle Data Mining VALID 10.2.0.5.0
CONTEXT Oracle Text VALID 10.2.0.5.0
JAVAVM JServer JAVA Virtual Machine VALID 10.2.0.5.0
XML Oracle XDK REMOVED 10.2.0.5.0
CATJAVA Oracle Database Java Packages VALID 10.2.0.5.0
APS OLAP Analytic Workspace VALID 10.2.0.5.0
XOQ Oracle OLAP API VALID 10.2.0.5.0
COMP_ID COMP_NAME STATUS VERSION
------------------------------ -------------------------------------------------- ----------- ------------------------------
AMD OLAP Catalog VALID 10.2.0.5.0
12 rows selected.
SQL> BEGIN
2 IF dbms_registry.status('CATJAVA') IS NULL THEN
3 RAISE_APPLICATION_ERROR(-20000,
4 'CATJAVA has not been loaded into the database.');
5 END IF;
6 IF dbms_registry.is_loaded('JAVAVM') != 1 THEN
7 RAISE_APPLICATION_ERROR(-20000,
8 'JServer is not operational in the database; ' ||
9 'JServer is required to remove CATJAVA from the database.');
10 END IF;
11 END;
12 /
BEGIN
*
ERROR at line 1:
ORA-20000: JServer is not operational in the database; JServer is required to remove CATJAVA from the database.
ORA-06512: at line 7


升级的大晚上碰到这个问题,大家都有点手足无策,没办法从metalink中还是找到了相关的一些步骤,

需要re-load一下。然后remove,re-install试试。


SQL> execute sys.dbms_registry.loaded('JAVAVM');
PL/SQL procedure successfully completed.
SQL> execute sys.dbms_registry.loaded('CATJAVA');
PL/SQL procedure successfully completed.
SQL> execute sys.dbms_registry.valid('JAVAVM');
PL/SQL procedure successfully completed.
SQL> execute sys.dbms_registry.valid('CATJAVA');
PL/SQL procedure successfully completed


2. DB startup upgrade failed,

在切换了oracle home 从10g到11g之后,尝试启动到startup upgrade模式,竟然包了如下的错误。


startup upgrade

ORA-00600: internal error code, arguments: [kccsbck_first], [1], [3125205511],

[], [], [], [], [], [], [], [],

solution:



最后还是先保证11g的环境下没有ora进程存在,然后切回10g的环境,看有没有ora的进程,如果有一定要停掉。然后再切回11g home,就可以了。

3. timezone issue

timezone在10gR2 0.5.0版本中是version 4,但是在11g中是version 14.一定要确保升级到version 14,要不会出现很多奇怪的问题,尤其在时间敏感的系统中,比如订购系统中,这个一定要保证正确。

有一个简单的方法可以测试一下timezone是否升级成功。


1)select * from v$timezone_file;

2) exp with consistent=y in table mode



按照预定的步骤我升级了timezone,但是在exp consistent=y的时候竟然还是报错了。

如果export 不加consistent=y就没有问题,import也没有任何问题。


exp APP_ROLLBK/APP_ROLLBK file=test.dmp tables=AAAAA  consistent=y
Export: Release 11.2.0.2.0 - Production on Tue Oct 8 08:30:08 2013
Copyright (c) 1982, 2009, Oracle and/or its affiliates. All rights reserved.
Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
Export done in UTF-8 character set and UTF8 NCHAR character set
About to export specified tables via Conventional Path ...
. . exporting table AAAAA 76 rows exported
EXP-00008: ORACLE error 1466 encountered
ORA-01466: unable to read data - table definition has changed
Export terminated successfully with warnings.


为了排查,测试了下expdp是否也有问题,测试了一下,expdp 正常expdp XXXX/XXX DIRECTORY=DATA_PUMP_DIR DUMPFILE=test.dmp tables=AAAAA consistent=y眼看离rollback的时间越来越近,但是这个问题还是没有解决。照着oracle的建议做了如下测试,但是还是不行。


what we have tried suggested by ORACLE:

1. we create a new table, and export using below command

exp xxxx consistent=y …

there will be issues

2. we tried to export from schema1(without consistent=y), and import into schema2, and export from schema2 with consistent=y

there will still issues.



咨询了下公司的资深专家,给了如下的建议


From Oracle MOS doc # 1011680.102 –

The scenario you report usually occurs when some change has been made to the System date or FIXED_DATE in the init.ora.

The problem happens because the creation timestamp of objects is greater than the System time.

o If you are using FIXED_DATE, you have to add a few minutes (or a few hours) to the TIME.

DBA: we don’t set it manually, just keep it is.

o If not, it could be that the System date of the OS is incorrect. If this is the case, you have do another full export without the consistent=y – and recreate the database.

DBA : we checked physical lock and date, they are synchronized,

hwclock;date

Tue 08 Oct 2013 01:18:18 PM ICT -0.267508 seconds

Tue Oct 8 13:18:17 ICT 2013

You can run this query to identify the wrongly-defined objects:

select to_char (created, 'dd-mm-yyyy hh24:mi:ss') "CREATION TIME", object_name, object_type, object_id from dba_objects where created > sysdate;

DBA: we have ran above query, and nothing returned.


最后oracle试着想重新升级一下timezone,第一个步骤是purge recyclebin;

做完之后试着exp consistent=y了一下,竟然可以了!

关于recycle bin的部分,oracle最后的回复如下:


Reference with note : Complete Checklist for Manual Upgrades to 11gR2 (Doc ID 837570.1)

Action :

#sqlplus / as sysdba

SQL> purge recyclebin ;

EXP-00008: ORACLE error 1466 encountered

ORA-01466: unable to read data - table definition has changed

Export terminated successfully with warnings.

This is a time-based read consistency error for a database object, such as a table or index.

Either of the following may have happened: The query was parsed and executed with a snapshot older than the time the object was changed. The creation time-stamp of the object is greater than the current system time. This happens, for example, when the system time is set to a time earlier than the creation time of the object.

Output from script. (utlu112i.sql) recommend that recycle bin is empty prior to upgrading


4. exp with consistent in schema mode

第三个问题解决了,第四个问题就迎刃而解了,继续后面的操作。


exp prdrefwork/dba4true file=test.dmp consistent=y
Export: Release 11.2.0.2.0 - Production on Tue Oct 8 07:58:49 2013
Copyright (c) 1982, 2009, Oracle and/or its affiliates. All rights reserved.
Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
Export done in UTF-8 character set and UTF8 NCHAR character set
. exporting pre-schema procedural objects and actions
EXP-00008: ORACLE error 1466 encountered
ORA-01466: unable to read data - table definition has changed
ORA-06512: at "SYS.DBMS_AW_EXP", line 818
ORA-06512: at line 1
EXP-00083: The previous problem occurred when calling SYS.DBMS_AW_EXP.schema_info_exp
. exporting foreign function library names for user xxxxx
. exporting PUBLIC type synonyms
. exporting private type synonyms
. exporting object type definitions for user xxxx
About to export xxxxxx objects ..


5.error from alert log

升级基本完成,检查alert log,发现如下的错误。显示小小紧张了一下,排除了其他的可能, 是timezone升级的修复,修复之后再没有这些错误了。Archived Log entry 6765 added for thread 1 sequence 6728 ID 0xb8c6d509 dest 1:Tue Oct 08 04:51:01 2013ORA-1466 (RO Tx began: 10/07/2013 21:51:00, Last DDL: 10/08/2013 02:07:25, Curr Time: 10/07/2013 21:51:00)Tue Oct 08 06:04:15 2013The value (158) of MAXTRANS parameter ignored.还有一个问题需要注意的是,如果在10g中有外部表的话,升级11g的时候一定也要注意,很有可能外部表会有问题,(比如directory的变更),导致在oracle的内部job中报出很多错误,无法进行。


6. ora-00020 throw after DB upgrade

升级工作基本完成了,还没准备休息。就听有app的人反应数据库的连接不够了。查看alert日志,发现如下的日志,自己试着用sysdba登,竟然都提示没有process了。Archived Log entry 6793 added for thread 1 sequence 6756 ID 0xb8c6d509 dest 1:Wed Oct 09 00:31:24 2013ORA-00020: maximum number of processes 5000 exceededORA-20 errors will not be written to the alert log for the next minute. Please look at trace files to see all the ORA-20 errors.Wed Oct 09 00:32:33 2013ORA-00020: maximum number of processes 5000 exceededORA-20 errors will not be written to the alert log for the next minute. Please look at trace files to see all the ORA-20 errors.Wed Oct 09 00:35:03 2013稍等了一下,再次登进去,发现有大量的inactive session.而且sql_id都是null.证明那些session再没有执行相关的sql,和等待事务。而且program都是指向一个客户端,根据以往的经验是app在程序中没有释放连接。先作为临时解决方法,让他们把那个客户端过来的inactive session都kill了,先让其他的流程能够跑起来。kill之后,几个小时系统基本正常了,但是这个问题还是一个定时炸弹,不知道根本的原因都不踏实。过了几个小时候,我监控session数突然从800增加到了3000多,而且还在不断增加。

as session suddenly increased from 800 to 3458,赶紧给team发邮件,让他们监控在那段时间有什么job/daemon在跑。

最后根据客户的反馈也比较模糊,他们说起了很多job,确实不好定位问题。

这下可苦了我们,不过还好我有开发功底,我从他们那拿到代码,排除了各种不想管的部分后,一点一点的琢磨起来,最后一共相关的job有40个左右,当我看到第20个的时候,发现connection明显处理有问题,会循环产生相当多的connection。但是没有合理的close,定位了问题。

让我们由被动变为主动。客户也不好意思,加紧去修复那个问题了。

STATUS          CNT
-------- ----------
KILLED 122
ACTIVE 38
INACTIVE 3298
----------
sum 3458

7.re-create user_synonyms view