ora-00600 [kdsgrp1]故障说明

1. 参考文档

ORA-00600 [kdsgrp1]产生的原因请参考:Causes and Solutions for ora-600 [kdsgrp1] (文档 ID 1332252.1)
相关Bug请参考:ORA-600 [kdsgrp1] (文档 ID 285586.1)

虽然MOS上说这个报错是Oracle Buffer Cache的内存块错误,但是没有确认就贸然刷buffer_cache,总是不那么令人信服,这里的一个案例验证确实是内存错误的原因。

2. 故障描述

短信告警xxxx02文件系统满了,发现trace目录中都是cdmp_2016xxxx的转储,查看alert日志,
发现都是ORA-00600 [kdsgrp1]错误。

Thu Nov 17 12:10:57 2016
Sweep [inc][1066936]: completed
Thu Nov 17 12:13:11 2016
Errors in file /oracle/diag/rdbms/xxxx/xxxx2/trace/xxxx2_j009_101954.trc (incident=1069120):
ORA-00600: 内部错误代码, 参数: [kdsgrp1], [], [], [], [], [], [], [], [], [], [], []
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.

根据Causes and Solutions for ora-600 [kdsgrp1] (文档 ID 1332252.1)描述,这个报错可能是内存坏块导致。

3. 判断内存坏块

3.1 trc分析

查看xxxx2_j009_101954.trc文件,发现如下信息:

*** 2016-11-17 12:47:14.204
END OF DUMP REDO
* kdsgrp1-2: *************************************************
DDE: Problem Key 'ORA 600 [kdsgrp1]' was flood controlled (0x6) (incident: 1069124)
ORA-00600: 内部错误代码, 参数: [kdsgrp1], [], [], [], [], [], [], [], [], [], [], []
* kdsgrp1-1: *************************************************
            row 0x1013a273.8 continuation at
            0x1013a273.8 file# 64 block# 1286771 slot 8 not found
KDSTABN_GET: 0 ..... ntab: 0
curSlot: 8 ..... nrows: 0
kdsgrp - dump CR block dba=0x1013a273
Block header dump:  0x1013a273
 Object id on Block? Y
 seg/obj: 0x30694  csc: 0xcf2.1f17662c  itc: 2  flg: E  typ: 1 - DATA
     brn: 0  bdba: 0x10138c05 ver: 0x01 opc: 0
     inc: 0  exflg: 0

 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x0000.000.00000000  0x00000000.0000.00  ----    0  fsc 0x0000.00000000
0x02   0x0000.000.00000000  0x00000000.0000.00  ----    0  fsc 0x0000.00000000
bdba: 0x1013a273
data_block_dump,data header at 0x7ab0f8064
===============
tsiz: 0x1f98
hsiz: 0xe
pbl: 0x7ab0f8064
     76543210
flag=--------
ntab=0
nrow=0
frre=-1
fsbo=0xe
fseo=0x1f98
avsp=0x1f8a
tosp=0x1f8a
block_row_dump:
end_of_block_dump
kdsgrp1: dump current block dba=0x1013a273
Block dump from cache:
Dump of buffer cache at level 4 for tsn=65 rdba=269722227
BH (0x7af7a67f8) file#: 64 rdba: 0x1013a273 (64/1286771) class: 1 ba: 0x7ab0f8000
  set: 60 pool: 3 bsz: 8192 bsi: 0 sflg: 1 pwc: 0,25
  dbwrid: 2 obj: 198292 objn: 198292 tsn: 65 afn: 64 hint: f
  hash: [0xf146b1a40,0xf146b1a40] lru: [0x60798cd68,0x24799e348]
  lru-flags: hot_buffer
  ckptq: [NULL] fileq: [NULL] objq: [0xceaeeaf40,0xceaeeaf40] objaq: [0xceaeeaf30,0xceaeeaf30]
  use: [NULL] wait: [NULL] fast-cr-pins: 2
  st: SCURRENT md: NULL fpin: 'kdswh05: kdsgrp' tch: 4 le: 0x5afc5d3a0
  flags: remote_transfered
  LRBA: [0x0.0.0] LSCN: [0x0.0] HSCN: [0xffff.ffffffff] HSUB: [65535]
Block dump from disk:
buffer tsn: 65 rdba: 0x1013a273 (64/1286771)
scn: 0x0cf2.1f17662c seq: 0x01 flg: 0x04 tail: 0x662c0601
frmt: 0x02 chkval: 0xd036 type: 0x06=trans data
Hex dump of block: st=0, typ_found=1
Dump of memory from 0x00007F5E8FF75200 to 0x00007F5E8FF77200
7F5E8FF75200 0000A206 1013A273 1F17662C 04010CF2  [....s...,f......]
7F5E8FF75210 0000D036 001A0101 00030694 1F17662C  [6...........,f..]
7F5E8FF75220 1FE80CF2 00321F02 10138C05 00000000  [......2.........]
7F5E8FF75230 00000000 00000000 00000000 00000000  [................]
        Repeat 2 times
7F5E8FF75260 00000000 00000000 000EFFFF 1F8A1F98  [................]
7F5E8FF75270 00001F8A 14740000 079C12C0 043405E8  [......t.......4.]
.......

重要信息如下:

* kdsgrp1-1: *************************************************
            row 0x1013a273.8 continuation at
            0x1013a273.8 file# 64 block# 1286771 slot 8 not found

这个块地址是:0x1013a273,而且在trace中,有多次该块的转储

$ grep "not found" /oracle/diag/rdbms/xxxx/xxxx2/trace/xxxx2_j009_101954.trc
            0x1013a273.4 file# 64 block# 1286771 slot 4 not found
            0x1013a273.1 file# 64 block# 1286771 slot 1 not found
            0x1013a273.8 file# 64 block# 1286771 slot 8 not found
            0x1013a273.4 file# 64 block# 1286771 slot 4 not found
            0x1013a273.1 file# 64 block# 1286771 slot 1 not found
            0x1013a273.8 file# 64 block# 1286771 slot 8 not found

$ grep "Block header dump" /oracle/diag/rdbms/xxxx/xxxx2/trace/xxxx2_j009_101954.trc
Block header dump:  0x1013a273
Block header dump:  0x1013a273
Block header dump:  0x1013a273
Block header dump:  0x1013a273
Block header dump:  0x1013a273
Block header dump:  0x1013a273
Block header dump:  0x1013a273
Block header dump:  0x1013a273
Block header dump:  0x1013a273
Block header dump:  0x1013a273
Block header dump:  0x1013a273
Block header dump:  0x1013a273

在xxxx2_j009_101954.trc文件中,查看这个块的转储内容,发现也是比较奇怪,这个块的转储不是一个正常的块转储格式:

1) 首先块上的ITL槽位信息全是0,这不正常

Itl Xid Uba Flag Lck Scn/Fsc
0x01 0x0000.000.00000000 0x00000000.0000.00 ---- 0 fsc 0x0000.00000000
0x02 0x0000.000.00000000 0x00000000.0000.00 ---- 0 fsc 0x0000.00000000

2) 没有块上数据行的转储信息

3.2 相关对象信息

typ: 1 – DATA 表明是数据块,seg/obj: 0x30694是对象的data_object_id,转成10进制为:198292

SQL> select o.owner, o.object_name, o.SUBOBJECT_NAME, o.object_type from dba_objects o where o.data_object_id = 198292;
OWNER  OBJECT_NAME       SUBOBJECT_NAME  OBJECT_TYPE
------ ----------------- --------------- -------------------
xxxx   xxxxxxxxxxxxxxx   PART1511 TABLE  PARTITION

SQL> select bytes/1024/1024 from dba_segments where segment_name = 'B2O_FLOWBUS_BAK' and owner = 'SOYZ' and partition_name = 'PART1511';
BYTES/1024/1024
---------------
1600

通过数据块上的DBA地址(0x1013a273)也可以确认

SQL> select  dbms_utility.data_block_address_file(to_number('1013a273', 'xxxxxxxxxxxxxxx')) as fno,
  2          dbms_utility.data_block_address_block(to_number('1013a273', 'xxxxxxxxxxxxxxx')) as bno
  3  from dual;
       FNO        BNO
---------- ----------
        64    1286771

SQL> set linesize 180
SQL> column owner format a15
SQL> column segment_type format a20
SQL> column segment_name format a30
SQL> column PARTITION_NAME format a30
SQL> column tablespace_name format a20
SQL> select owner, segment_type, segment_name, PARTITION_NAME, tablespace_name
    2 from dba_extents
    3 where file_id = 64 and 1286771 between block_id and block_id+blocks-1;

OWNER SEGMENT_TYPE    SEGMENT_NAME    PARTITION_NAME TABLESPACE_NAME
----- --------------- --------------- -------------- ---------------
xxxx  TABLE PARTITION xxxxxxxxxxxxxxx PART1511       TBS_xxxxx

这个块属于问题表的PART1511分区,大小是1600M。

3.3 校验表结构

SQL> @$ORACLE_HOME/rdbms/admin/utlvalid.sql
Table created.

SQL> analyze table SOYZ.B2O_FLOWBUS_BAK partition(PART1511) validate structure online;
Table analyzed.

SQL> select * from INVALID_ROWS;
no rows selected

没发现有问题的行信息

3.4 dbv分析

$ dbv file=/oradata01/xxxx/xxxxx1.dbf BLOCKSIZE=8192 START=1286770 END=1286772

DBVERIFY: Release 11.2.0.4.0 - Production on Thu Nov 17 13:10:49 2016

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

DBVERIFY - Verification starting : FILE = /oradata01/crmhis/tbs_so_yz1.dbf


DBVERIFY - Verification complete

Total Pages Examined         : 3
Total Pages Processed (Data) : 3
Total Pages Failing   (Data) : 0
Total Pages Processed (Index): 0
Total Pages Failing   (Index): 0
Total Pages Processed (Other): 0
Total Pages Processed (Seg)  : 0
Total Pages Failing   (Seg)  : 0
Total Pages Empty            : 0
Total Pages Marked Corrupt   : 0
Total Pages Influx           : 0
Total Pages Encrypted        : 0
Highest block SCN            : 2407985090 (3371.2407985090)

dbv分析结果,也不是坏块

3.5 bbed查看块结构

$ bbed filename=/oradata01/xxxx/xxxxx1.dbf blocksize=8192 password=blockedit

BBED> set block 1286771
        BLOCK#          1286771

BBED> map
 File: /oradata01/crmhis/tbs_so_yz1.dbf (0)
 Block: 1286771                                Dba:0x00000000
------------------------------------------------------------
 KTB Data Block (Table/Cluster)
 struct kcbh, 20 bytes                      @0       
 struct ktbbh, 72 bytes                     @20      
 struct kdbh, 14 bytes                      @100     
 struct kdbt[1], 4 bytes                    @114     
 sb2 kdbr[19]                               @118     
 ub1 freespace[1118]                        @156     
 ub1 rowdata[6914]                          @1274    
 ub4 tailchk                                @8188  

bbed可以看到正确的块结构,也可以看到块上的行结构。

3.6 转储块

SQL> oradebug setmypid
Statement processed.
SQL> oradebug tracefile_name
/oracle/diag/rdbms/xxxx/xxxxx2/trace/xxxxx2_ora_39390.trc
SQL> alter system dump datafile 64 block 1286771;
System altered.

块转储内容:

Block header dump:  0x1013a273
 Object id on Block? Y
 seg/obj: 0x30694  csc: 0xcf2.1f17662c  itc: 2  flg: E  typ: 1 - DATA
     brn: 0  bdba: 0x10138c05 ver: 0x01 opc: 0
     inc: 0  exflg: 0
 
 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x0000.000.00000000  0x00000000.0000.00  ----    0  fsc 0x0000.00000000
0x02   0x0000.000.00000000  0x00000000.0000.00  ----    0  fsc 0x0000.00000000
bdba: 0x1013a273
data_block_dump,data header at 0x7f0278442a64
===============
tsiz: 0x1f98
hsiz: 0xe
pbl: 0x7f0278442a64
     76543210
flag=--------
ntab=0
nrow=0
frre=-1
fsbo=0xe
fseo=0x1f98
avsp=0x1f8a
tosp=0x1f8a
block_row_dump:
end_of_block_dump
GLOBAL CACHE ELEMENT DUMP (address: 0x5afc5d3a0):
  id1: 0x13a273 id2: 0x40 pkey: OBJ#198292 block: (64/1286771)
  lock: S rls: 0x0 acq: 0x0 latch: 15
  flags: 0x20 fair: 0 recovery: 0 fpin: 'kdswh05: kdsgrp'
  bscn: 0xcf2.1f17662c bctx: (nil) write: 0 scan: 0x0 
  lcp: (nil) lnk: [NULL] lch: [0x7af7a6930,0x7af7a6930]
  seq: 17643 hist: 334 144:0 213 7 352 197 48 121 334 67 144:0 98 7
  LIST OF BUFFERS LINKED TO THIS GLOBAL CACHE ELEMENT:
    flg: 0x08000000 state: SCURRENT tsn: 65 tsh: 7
      addr: 0x7af7a67f8 obj: 198292 cls: DATA bscn: 0xcf2.1f17662c
 GCS CLIENT 0x5afc5d418,8882 resp[(nil),0x13a273.40] pkey 198292.0
   grant 1 cvt 0 mdrole 0x1 st 0x100 lst 0x20 GRANTQ rl LOCAL
   master 4 owner 2 sid 1 remote[(nil),0] hist 0xc02f060178300bc
   history 0x3c.0x1.0xc.0x3c.0x1.0xc.0x3c.0x1.0xc.0x0.
   cflag 0x0 sender 4 flags 0x0 replay# 0 abast (nil).x0.1 dbmap (nil)
   disk: 0x0000.00000000 write request: 0x0000.00000000
   pi scn: 0x0000.00000000 sq[(nil),(nil)]
   msgseq 0x1 updseq 0x0 reqids[9891,0,0] infop (nil) lockseq x1be9
   pkey 198292.0
   hv 98 [stat 0x0, 4->4, wm 32768, RMno 0, reminc 7, dom 0]
   kjga st 0x4, step 0.0.0, cinc 27, rmno 8, flags 0x0
   lb 0, hb 0, myb 988238, drmb 988238, apifrz 0
 GCS CLIENT END
2016-11-17 13:05:25.812617 : kjbmbassert [0x13a273.40]
2016-11-17 13:05:25.812674 : kjbmsassert(0x13a273.40)(to 1)(lvl 4)
2016-11-17 13:05:25.812731 : kjbmsassert(0x13a273.40)(to 3)(lvl 4)
2016-11-17 13:05:25.812752 : kjbmsassert(0x13a273.40)(to 4)(lvl 4)
End dump data blocks tsn: 65 file#: 64 minblk 1286771 maxblk 1286771

使用alter system dump datafile 64 block 1286771,转储的是oracle buffer cache中的内存块,而DBV和bbed是直接读的物理文件。
从这里可以确定,内存块有问题,而物理文件上的块没问题。

4. 解决故障

由于xxxx是4节点RAC,所以依次把4个节点的buffer_cache刷新
SQL> alter system flush buffer_cache;

刷完后,不再报ORA-00600 [kdsgrp1],但是又报了ORA-8102错:

Thu Nov 17 15:16:32 2016
Errors in file /oracle/diag/rdbms/xxxx/xxxxx1/trace/xxxxx1_j012_213876.trc:
Thu Nov 17 15:16:33 2016
Dumping diagnostic data in directory=[cdmp_20161117151633], requested by (instance=1, osid=213876 (J012)), summary=[abnormal process termination].
Thu Nov 17 15:18:25 2016
Errors in file /oracle/diag/rdbms/xxxx/xxxxx1/trace/xxxxx1_j012_213876.trc:
Thu Nov 17 15:18:26 2016
Dumping diagnostic data in directory=[cdmp_20161117151826], requested by (instance=1, osid=213876 (J012)), summary=[abnormal process termination].
Thu Nov 17 15:24:24 2016
Dumping diagnostic data in directory=[cdmp_20161117152424], requested by (instance=3, osid=103831 (J002)), summary=[abnormal process termination].

查看xxxxx1_j012_213876.trc文件:

*** 2016-11-17 15:16:32.481
*** SESSION ID:(3013.14263) 2016-11-17 15:16:32.481
*** CLIENT ID:() 2016-11-17 15:16:32.481
*** SERVICE NAME:(SYS$USERS) 2016-11-17 15:16:32.481
*** MODULE NAME:() 2016-11-17 15:16:32.481
*** ACTION NAME:() 2016-11-17 15:16:32.481
 
oer 8102.2 - obj# 202016, rdba: 0x56302b85(afn 344, blk# 3156869)
kdk key 8102.2:
  ncol: 2, len: 19
  key: (19):  07 c6 0f 03 1f 34 05 0d 0a 00 03 15 05 15 14 95 11 00 0d
  mask: (4096): 
 01 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

这是索引与数据块不一致,重建索引即可

SQL> set linesize 180
SQL> column owner format a15
SQL> column segment_type format a20
SQL> column segment_name format a30
SQL> column PARTITION_NAME format a30
SQL> column tablespace_name format a20
SQL> select owner, segment_type, segment_name, PARTITION_NAME, tablespace_name
2 from dba_extents
3 where file_id = 344 and 3156869 between block_id and block_id+blocks-1;

OWNER SEGMENT_TYPE SEGMENT_NAME         PARTITION_NAME  TABLESPACE_NAME
----- ------------ -------------------- --------------- ---------------
xxxx  INDEX        IDX_xxxxxxxxxxxxxxxx                 TBS_xxxxx

SQL> select table_owner, table_name, GLOBAL_STATS from dba_indexes where index_name = 'IDX_BUSI_ORDER_BO_ID' and owner = 'SOYZ';

TABLE_OWNER                    TABLE_NAME                     GLO
------------------------------ ------------------------------ ---
xxxx                           xxxxxxxxxxxxxx                 YES

关于紫砂壶

感悟技术人生
此条目发表在ORA-分类目录,贴了标签。将固定链接加入收藏夹。