一次迫不得以的升级

1.简单的重启

一天晚上,凌晨12点左右,正睡得迷迷糊糊,被电话叫醒,说有套老的HA环境,主用的主机宕了,正在重启,请到时侯主机起好后,帮忙起下库。我就应下了,上下VPN看了下,主机还没有启好,等了20分钟,问一下,还是没有启动好,我说了声我先睡会,好了叫我,然后我就去睡了。
到1点多,电话过来了,说主用主机硬件故障,暂时没有备件,也就是说主用主机拉不起来,把存储、浮动IP切到备用节点,在备机上把数据库拉起来,这些操作主机那边已经做完。这套库平时不在我这里维护,我看了下,VG确实都在备机上激活了,那些VG是不是数据库在用的,我不清楚,但至少没有未激活的VG,浮动IP也过来了。

2.踩坑

由于这套环境平时不是我维护,我就问了下业务人员,数据库的SID是什么,告诉我是bill,然后我就到备机的$ORACLE_HOME/dbs里去看了下,奇怪,怎么没有相应的spfile和pfile,我跟业务维护人员确认一下,是叫bill?他们确认。好吧,然后我告诉他们,在备机上,数据库的参数文件没有,想拉起库来,得自已编一个了。由于主用节点的主机没起来,我也不可能从主用主机上面拿了(顺便说一句,如果oracle是装在共享文件系统上,就没后面的事了)。我看了下备机,上面跑着一个数据库实例billold,我问了下业务人员,他们说是个测试库,可以停掉。停不停就无所谓了,要起的库是bill,不冲突。
我启了下sqlplus,看了下oracle的版本,是10.2.0.5,然后我就去找了下10g的参数文件内容,然后到存储VG里去找控制文件路径,找到如下三个控制文件:

/dev/vgbill01/rlvctrl_bill_200_1,
/dev/vgbill02/rlvctrl_bill_200_2, 
/dev/vgbill03/rlvctrl_bill_200_3

再在备机上创建一些目录,好了,整个pfile就出来了。从这里其实就应该发现一些端倪,怎么备机上,没有任何bill实例的信息,照道理,做备机时,应该做过切换的,备机上应该有相关信息的,但是由于这个库实在是很老了,主机也是很老的主机,HP v11.11 PA环境。我大意了,以为备机上的信息不知道被谁弄掉了,或者是这个备机不知道从哪边拿来的吧,压根就没在这个备机上做过切换操作。(以前有很多次这种临时拿个备机来用的)。
好了,做出来的pfile内容如下:

*.background_dump_dest='/ora9i/oracle/admin/bill/bdump'
*.compatible='9.2.0.0.0'
*.control_files='/dev/vgbill01/rlvctrl_bill_200_1','/dev/vgbill02/rlvctrl_bill_200_2','/dev/vgbill03/rlvctrl_bill_200_3'
*.core_dump_dest='/ora9i/oracle/admin/bill/cdump'
*.db_block_size=8192
*.db_cache_size=5016582400
*.db_domain=''
*.db_file_multiblock_read_count=16
*.db_files=800
*.db_name='bill'
*.dispatchers='(PROTOCOL=TCP) (SERVICE=billXDB)'
*.fast_start_mttr_target=300
*.hash_join_enabled=TRUE
*.instance_name='bill'
*.java_pool_size=104857600
*.job_queue_processes=2
*.large_pool_size=8388608
*.log_archive_dest_1='location=/ticket/arch'
*.log_archive_start=TRUE
*.open_cursors=8192
*.pga_aggregate_target=8589934592
*.processes=850
*.query_rewrite_enabled='FALSE'
*.remote_login_passwordfile='EXCLUSIVE'
*.shared_pool_size=419430400
*.sort_area_size=4194304
*.star_transformation_enabled='FALSE'
*.timed_statistics=TRUE
*.undo_management='AUTO'
*.undo_retention=10800
*.undo_tablespace='UNDOTBS1'
*.user_dump_dest='/ora9i/oracle/admin/bill/udump'

开始起数据库

SQL> startup nomount
ORA-32004: obsolete and/or deprecated parameter(s) specified
ORACLE instance started.

Total System Global Area 1.9462E+10 bytes
Fixed Size                  2115568 bytes
Variable Size            2264926224 bytes
Database Buffers        1.7180E+10 bytes
Redo Buffers              14659584 bytes
SQL> alter database mount;
alter database mount
*
ERROR at line 1:
ORA-01102: cannot mount database in EXCLUSIVE mode

看了下alert日志

alter database mount
Mon Sep 25 01:27:25 EAT 2017
sculkget: failed to lock /ora9i/app/product/10.2.0/dbs/lkBILL exclusive
sculkget: lock held by PID: 5420
Mon Sep 25 01:27:25 EAT 2017
ORA-09968: unable to lock file
HP-UX Error: 13: Permission denied
Additional information: 5420
Mon Sep 25 01:27:25 EAT 2017
ORA-1102 signalled during: alter database mount...

/ora9i/app/product/10.2.0/dbs/lkBILL这个文件存在,我看了下文件的权限,没有问题,有点奇怪

$ ls -l
total 144
-rw-r--r--  1 oracle    dba          41140 Nov 27  2013 analyze.sql
drwxrwxrwx  2 oracle    dba          2048 Aug 11  2009 bak
-rw-rw-rw-  1 oracle    dba            958 Aug 11  2009 initbillold.ora
-rw-rw----  1 oracle    dba            24 Sep 25 02:19 lkBILL

我看了下*.compatible=’9.2.0.0.0’,这个参数我是随便从其它的一个库里拿过来的,当前备机上软件是10.2.0.5了,可能这个参数低了吧。
想着,如果数据库以前的compatible低点,改成10.2.0.5也没问题,就将这个参数改为*.compatible=’10.2.0.5’。

重启数据库还是报

alter database mount
Mon Sep 25 01:31:04 EAT 2017
sculkget: failed to lock /ora9i/app/product/10.2.0/dbs/lkBILL exclusive
sculkget: lock held by PID: 5420
Mon Sep 25 01:31:04 EAT 2017
ORA-09968: unable to lock file
HP-UX Error: 13: Permission denied
Additional information: 5420

我就将/ora9i/app/product/10.2.0/dbs/lkBILL这个文件删了,然后执行

SQL> alter database mount

数据库mount了,然后打开库,

SQL> alter database open;
alter database open
*
ERROR at line 1:
ORA-01092: ORACLE instance terminated. Disconnection forced

怎么数据库打不开?看了下alert日志,这时侯我的心拔凉拔凉的。内容如下:

Mon Sep 25 01:32:14 EAT 2017
alter database mount
Mon Sep 25 01:32:19 EAT 2017
ALERT: Compatibility of the database is changed from pre-10i to 10.2.0.5.0.
Increased the record size of controlfile section 2 to 256 bytes
Control file expanded from 2490 blocks to 2492 blocks
Increased the record size of controlfile section 4 to 428 bytes
Control file expanded from 2492 blocks to 2990 blocks
Increased the record size of controlfile section 9 to 56 bytes
Control file expanded from 2990 blocks to 2994 blocks
Increased the record size of controlfile section 10 to 200 bytes
Control file expanded from 2994 blocks to 3032 blocks
kcv_lh_or_upgrade: 10.2 upgrading 1 incarnations
Control file expanded from 3032 to 3328 blocks for upgrade.
kcv_lh_or_upgrade: 10.2 upgrading 1 incarnations
Setting recovery target incarnation to 1
ORA-59 signalled during: alter database mount...

大家看出什么了吗?对的,数据库不是10g的,是9i的。这时侯,我才想起,问下业务人员,这数据库是9i还是10g啊?业务人员简单地回答了,是9i。
此时,我才意识到,我踩了个多大的坑啊,我心里现在有千万只草泥马跑过。

3.尝试9i重启

这时侯我在备机上看,billold这个库确实是10.2.0.5的,然后在备机上找啊找,终于找到了,原来备机的oracle已经升级成10.2.0.5了,原来的bill信息,还在老的9i目录里面。
但是oracle用户的环境变量已经指到了10g了,所以导致了我的这次踩坑,当然也怪我,凌晨1点被叫起来,对不熟悉的库麻痹大意了。
明知道的9i环境正常肯定已经起不来了,但不死心啊,试试吧,准备了9i的环境变量

export ORACLE_BASE=/ora9i/app
export ORACLE_HOME=/ora9i/oracle/product
export ORACLE_SID=bill
export PATH=$ORACLE_HOME/bin:/usr/bin:/usr/local/bin:/usr/ccs/bin:.
export NLS_LANG=AMERICAN_AMERICA.ZHS16CGB231280
unset ORA_NLS33

启动报错如下:

SQL> startup nomount
ORACLE instance started.

Total System Global Area 5707370232 bytes
Fixed Size                  745208 bytes
Variable Size            671088640 bytes
Database Buffers        5033164800 bytes
Redo Buffers                2371584 bytes
SQL> alter database mount;
alter database mount
*
ERROR at line 1:
ORA-00201: controlfile version 9.2.0.0.0 incompatible with ORACLE version
9.2.0.0.0
ORA-00202: controlfile: '/dev/vgbill01/rlvctrl_bill_200_1'

三个控制文件,都已经被升级了,9i环境无法mount。

3.想办法回退

想重建控制是不是可以救回来呢。先在10g环境下,备份控制文件的创建脚本到trc

SQL> alter database backup controlfile to trace;

然后就到9i环境下,重建控制文件

SQL> startup nomount
ORACLE instance started.

Total System Global Area 5707370232 bytes
Fixed Size                  745208 bytes
Variable Size            671088640 bytes
Database Buffers        5033164800 bytes
Redo Buffers                2371584 bytes
SQL> 
SQL> CREATE CONTROLFILE REUSE DATABASE "BILL" NORESETLOGS  ARCHIVELOG
  2      MAXLOGFILES 64
  3      MAXLOGMEMBERS 5
  4      MAXDATAFILES 8192
  5      MAXINSTANCES 32
  6      MAXLOGHISTORY 680
.....

脚本太长,就不罗列了,然而创建控制文件,给我这么个报错,我的心彻底凉了

CREATE CONTROLFILE REUSE DATABASE "BILL" NORESETLOGS  ARCHIVELOG
*
ERROR at line 1:
ORA-01503: CREATE CONTROLFILE failed
ORA-01210: data file header is media corrupt
ORA-01110: data file : '/dev/vgbill01/rsystem_bill_1000_1'

system文件头已经被改了,这时侯业务还告诉我,这是个生产库,好吧,这生产库就管理成这样子。但不管怎样,总要把这个库给救回来,看样子,这个没有准备的升级是必须上马了。

4. 逼不得已的升级过程

在10g环境里,试了下startup upgrade可以把数据库拉起来,这给我大大地松了口气。
9i升10g,在9i环境里需要跑一下utlu102i_2.sql脚本,现在9i环境打开不了库,我就在10g环境里跑这个脚本,
报错如下:

SQL> alter database open upgrade;

Database altered.

SQL> @utlu102i_2.sql
SP2-0734: unknown command beginning "Connecting..." - rest of line ignored.
SP2-0734: unknown command beginning "Fetched rd..." - rest of line ignored.
SP2-0734: unknown command beginning "its affili..." - rest of line ignored.
ERROR:
ORA-00600: internal error code, arguments: [17069], [0xC000000877E9AE98], [],
[], [], [], [], []


DECLARE
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [17069], [0xC000000877E9AE98], [],
[], [], [], [], []


ERROR:
ORA-00600: internal error code, arguments: [17069], [0xC000000877E9AE98], [],
[], [], [], [], []

看了下alert日志,报ORA-00600 [17069]

Mon Sep 25 02:32:42 EAT 2017
Errors in file /ora9i/oracle/admin/bill/udump/bill_ora_24100.trc:
ORA-00600: internal error code, arguments: [17069], [0xC000000877E9AE98], [], [], [], [], [], []
Mon Sep 25 02:32:52 EAT 2017
Errors in file /ora9i/oracle/admin/bill/udump/bill_ora_24100.trc:
ORA-00600: internal error code, arguments: [17069], [0xC000000877E9AE98], [], [], [], [], [], []
Mon Sep 25 02:33:02 EAT 2017
Errors in file /ora9i/oracle/admin/bill/udump/bill_ora_24100.trc:
ORA-00600: internal error code, arguments: [17069], [0xC000000877E9AE98], [], [], [], [], [], []

看了下trc:

kswscps: ORA-942 on service$ trapped and ignored during OPEN for migrate
kswscps: ORA-942 on service$ trapped and ignored during OPEN for migrate
kswscps: ORA-942 on service$ trapped and ignored during OPEN for migrate
kswscrs: unexpected exception err=604, err2=942
kswsgsn: ORA-942 on service$ trapped and ignored during OPEN for migrate
kswscps: ORA-942 on service$ trapped and ignored during OPEN for migrate
kswscrs: unexpected exception err=604, err2=942
kswsgsn: ORA-942 on service$ trapped and ignored during OPEN for migrate
kzam_upd_props: OCIStmtExecute1
Exception in kzam_upd_props, 942:ORA-00942: table or view does not exist
*** 2017-09-25 02:32:43.007
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [17069], [0xC000000877E9AE98], [], [], [], [], [], []
Current SQL statement for this session:
BEGIN DBMS_OUTPUT.ENABLE(NULL); END;
----- Call Stack Trace -----
calling              call    entry                argument values in hex
location            type    point                (? means dubious value)

报错发生在DBMS_OUTPUT.ENABLE上,感觉应该是系统包出问题了,查了下INVALID的系统包,

SQL> select owner, count(1) from dba_objects where status = 'INVALID' group by owner;

OWNER                            COUNT(1)
------------------------------ ----------
ACCT                                  217
BILL                                    6
CACCT                                  6
SYS                                  332
WMSYS                                  1
COMM                                  175
CHECKSCN                                1
ORDPLUGINS                              1
CTXSYS                                  8
PERFSTAT                                1

好吧,SYS里很多系统包是INVALID,执行utlirp.sql来编译,发现挂住了,检查session_wait,发现执行utlirp的SYS会话在等待row cache lock,cache#是8,是dc_objects,想着完蛋了,估计任何对象都无法创建了。杀掉了执行utlirp.sql的会话,再次执行utlu102i_2.sql,发现还是挂住,也是row cache lock,cache也是dc_objects。
头皮发麻,想着没办法了,死马当活马医了只能。将数据库重启一下,创建sysaux表空间,发现能创建,又给了我很大的心理安慰。
然后就是执行升级脚本

@?/rdbms/admin/catupgrd.sql

数据库很顺利地升级到了10.2.0.5。

6.后记

第二天,主用节点起来了,后面打了个tar包,将主用节点的软件也升级到了10.2.0.5。后面可以思考一下,如果判断数据文件是哪个版本的。
我知道在数据文件头上可以通过BBED看一下,有个地方指明了是哪个版本

关于紫砂壶

感悟技术人生
此条目发表在Oracle故障诊断分类目录,贴了标签。将固定链接加入收藏夹。

5 则回应给 一次迫不得以的升级

  1. LEI说:

    额,这个甲方也太坑了。。。

  2. 匿名说:

    。。。

  3. 刘磊说:

    跌宕起伏,峰回路转,荡气回肠!

  4. 匿名说:

    备库竟然升级了…

  5. 匿名说:

    重启大法好

评论已关闭。