OGG故障处理案例

1. 故障说明

vsf打补丁和NBU归档备份从带库改磁盘时,在这个系统上的OGG在起动时报缺归档。
故障的原因是:集群两个节点上备归档的脚本不一致,节点1备归档的脚本正确,
节点2上备归档的脚本未修改,导致多备了归档。并且NBU备份策略已经从带库改到磁盘,再修改回来比较麻烦(涉及到一些其它问题)。
所以先看下问题的情况了,如果缺的太多,为了数据一致性的问题,该搞回来也只能搞。

2. 分析OGG抽取进程的检查点和归档信息

GGSCI (dbxx01) 2> info ex_xx,detail

EXTRACT    EX_xx     Last Started 2016-12-14 22:57   Status STOPPED
Checkpoint Lag       00:00:03 (updated 07:51:23 ago)
Log Read Checkpoint  Oracle Redo Logs
                     2017-02-20 23:57:10  Thread 1, Seqno 105080, RBA 1517464632
                     SCN 3409.2815500250 (14644359012314)
Log Read Checkpoint  Oracle Redo Logs
                     2017-02-20 23:57:16  Thread 2, Seqno 57042, RBA 1068032
                     SCN 3409.2815599647 (14644359111711)

  Target Extract Trails:

  Trail Name                                       Seqno        RBA     Max MB Trail Type

  /gg/dirdat/xx                                     6986   26581772        200 EXTTRAIL  

  Extract Source                          Begin             End             

  /redo/xxxxx/redo61.dbf                  2016-12-14 22:05  2017-02-20 23:57
  Not Available                           2016-12-06 22:51  2016-12-14 22:09
  /redo/xxxxx/redo51.dbf                  2016-07-19 23:46  2016-12-06 22:51
  Not Available                           2016-06-14 23:45  2016-07-19 23:46
  Not Available                           2016-06-13 22:24  2016-06-14 23:45
  /redo/xxxxx/redo71.dbf                  2016-05-10 09:15  2016-06-13 22:28
  Not Available                           * Initialized *   2016-05-10 09:15


Current directory    /gg

Report file          /gg/dirrpt/EX_xx.rpt
Parameter file       /gg/dirprm/ex_xx.prm
Checkpoint file      /gg/dirchk/EX_xx.cpe
Process file         /gg/dirpcs/EX_xx.pce
Stdout file          /gg/dirout/EX_xx.out
Error log            /gg/ggserr.log


GGSCI (dbxx01) 3> info ex_xx, showch

EXTRACT    EX_xx     Last Started 2016-12-14 22:57   Status STOPPED
Checkpoint Lag       00:00:03 (updated 07:51:29 ago)
Log Read Checkpoint  Oracle Redo Logs
                     2017-02-20 23:57:10  Thread 1, Seqno 105080, RBA 1517464632
                     SCN 3409.2815500250 (14644359012314)
Log Read Checkpoint  Oracle Redo Logs
                     2017-02-20 23:57:16  Thread 2, Seqno 57042, RBA 1068032
                     SCN 3409.2815599647 (14644359111711)


Current Checkpoint Detail:

Read Checkpoint #1

  Oracle Threaded Redo Log

  Startup Checkpoint (starting position in the data source):
    Thread #: 1
    Sequence #: 95790
    RBA: 0
    Timestamp: 2016-12-14 22:05:45.000000
    SCN: 3396.3022848381 (14588731785597)
    Redo File: 

  Recovery Checkpoint (position of oldest unprocessed transaction in the data source):
    Thread #: 1
    Sequence #: 105068
    RBA: 1291909136
    Timestamp: 2017-02-20 21:00:37.000000
    SCN: 3409.2651512138 (14644195024202)
    Redo File: Not Available

  Current Checkpoint (position of last record read in the data source):
    Thread #: 1
    Sequence #: 105080
    RBA: 1517464632
    Timestamp: 2017-02-20 23:57:10.000000
    SCN: 3409.2815500250 (14644359012314)
    Redo File: Not Available

Read Checkpoint #2

  Oracle Threaded Redo Log

  Startup Checkpoint (starting position in the data source):
    Thread #: 2
    Sequence #: 51522
    RBA: 1091154960
    Timestamp: 2016-12-14 19:11:10.000000
    SCN: 3396.2867266378 (14588576203594)
    Redo File: Not Available

  Recovery Checkpoint (position of oldest unprocessed transaction in the data source):
    Thread #: 2
    Sequence #: 57042
    RBA: 1067024
    Timestamp: 2017-02-20 23:57:16.000000
    SCN: 3409.2815599647 (14644359111711)
    Redo File: /redo/xxxxx/redo61.dbf

  Current Checkpoint (position of last record read in the data source):
    Thread #: 2
    Sequence #: 57042
    RBA: 1068032
    Timestamp: 2017-02-20 23:57:16.000000
    SCN: 3409.2815599647 (14644359111711)
    Redo File: /redo/xxxxx/redo61.dbf

Write Checkpoint #1

  GGS Log Trail

  Current Checkpoint (current write position):
    Sequence #: 6986
    RBA: 26581772
    Timestamp: 2017-02-20 23:57:19.794808
    Extract Trail: /gg/dirdat/xx
    Trail Type: EXTTRAIL

Header:
  Version = 2
  Record Source = A
  Type = 11
  # Input Checkpoints = 2
  # Output Checkpoints = 1

File Information:
  Block Size = 2048
  Max Blocks = 100
  Record Length = 4096
  Current Offset = 0

Configuration:
  Data Source = 3
  Transaction Integrity = 1
  Task Type = 0

Status:
  Start Time = 2016-12-14 22:57:15
  Last Update Time = 2017-02-20 23:57:19
  Stop Status = G
  Last Result = 0

从当前OGG归档目录里看,thread 2的归档全在,但是thread 1的归档缺了几个。
OGG的thread 1的recovery checkpoint要求的归档日志是105068,而当前有的归档是105076
而thread 1的Current Checkpoint要求的归档是105080,全部都在,所以判断问题不是太严重。

$ ls | grep ^1_ | sort
1_105076_863885595.dbf
1_105077_863885595.dbf
1_105078_863885595.dbf
1_105079_863885595.dbf
1_105080_863885595.dbf
1_105081_863885595.dbf
1_105082_863885595.dbf
1_105083_863885595.dbf
1_105084_863885595.dbf
1_105085_863885595.dbf
1_105086_863885595.dbf
1_105087_863885595.dbf
1_105088_863885595.dbf
1_105089_863885595.dbf
1_105090_863885595.dbf
1_105091_863885595.dbf

从经验来看,OGG的Recovery Checkpoint和Current Checkpoint之间,其实只涉及一些长事务,这些长事务一般是dblink。
业务数据其实已经全部commit了,但是dblink会话不断,事务总是还在,但实际上对数据一致性基本无影响。
并且在这套系统上是部署了杀长事务的脚本,3小时前的事务都会被杀掉。

3. 修改OGG抽取进程检查点

综上考虑,就不恢复归档了,直接修改OGG抽取进程的Recovery Checkpoint。修改过程如下:

GGSCI (dbxx01) 1> ALTER EXTRACT EX_xx, THREAD 1, IOEXTSEQNO 105076 IOEXTRBA 0

2017-02-21 07:55:20  INFO    OGG-00988  WARNING: Unsupported operation. This might cause transactional inconsistency. Modifying input checkpoint #1, Oracle thread #1 of EXTRACT: ioseq = 105076 iorba = 0.
Are you sure you want to continue? y
EXTRACT altered.

确认Thread 1的Recovery Checkpoint修改正确

GGSCI (dbxx01) 3> info ex_xx, showch

EXTRACT    EX_xx     Initialized   2017-02-21 07:55   Status STOPPED
Checkpoint Lag       07:58:04 (updated 00:00:13 ago)
Log Read Checkpoint  Oracle Redo Logs
                     2017-02-20 23:57:10  Thread 1, Seqno 105080, RBA 1517464632
                     SCN 3409.2815500250 (14644359012314)
Log Read Checkpoint  Oracle Redo Logs
                     2017-02-20 23:57:16  Thread 2, Seqno 57042, RBA 1068032
                     SCN 3409.2815599647 (14644359111711)


Current Checkpoint Detail:

Read Checkpoint #1

  Oracle Threaded Redo Log

  Startup Checkpoint (starting position in the data source):
    Thread #: 1
    Sequence #: 95790
    RBA: 0
    Timestamp: 2016-12-14 22:05:45.000000
    SCN: 3396.3022848381 (14588731785597)
    Redo File: 

  Recovery Checkpoint (position of oldest unprocessed transaction in the data source):
    Thread #: 1
    Sequence #: 105076
    RBA: 0
    Timestamp: 
    SCN: Not available
    Redo File: 

  Current Checkpoint (position of last record read in the data source):
    Thread #: 1
    Sequence #: 105080
    RBA: 1517464632
    Timestamp: 2017-02-20 23:57:10.000000
    SCN: 3409.2815500250 (14644359012314)
    Redo File: Not Available

Read Checkpoint #2

  Oracle Threaded Redo Log

  Startup Checkpoint (starting position in the data source):
    Thread #: 2
    Sequence #: 51522
    RBA: 1091154960
    Timestamp: 2016-12-14 19:11:10.000000
    SCN: 3396.2867266378 (14588576203594)
    Redo File: Not Available

  Recovery Checkpoint (position of oldest unprocessed transaction in the data source):
    Thread #: 2
    Sequence #: 57042
    RBA: 1067024
    Timestamp: 2017-02-20 23:57:16.000000
    SCN: 3409.2815599647 (14644359111711)
    Redo File: /redo/xxxxx/redo61.dbf

  Current Checkpoint (position of last record read in the data source):
    Thread #: 2
    Sequence #: 57042
    RBA: 1068032
    Timestamp: 2017-02-20 23:57:16.000000
    SCN: 3409.2815599647 (14644359111711)
    Redo File: /redo/xxxxx/redo61.dbf

Write Checkpoint #1

  GGS Log Trail

  Current Checkpoint (current write position):
    Sequence #: 6986
    RBA: 26581772
    Timestamp: 2017-02-20 23:57:19.794808
    Extract Trail: /gg/dirdat/xx
    Trail Type: EXTTRAIL

Header:
  Version = 2
  Record Source = U
  Type = 11
  # Input Checkpoints = 2
  # Output Checkpoints = 1

File Information:
  Block Size = 2048
  Max Blocks = 100
  Record Length = 4096
  Current Offset = 0

Configuration:
  Data Source = 3
  Transaction Integrity = 1
  Task Type = 0

Status:
  Start Time = 2017-02-21 07:55:20
  Last Update Time = 2017-02-21 07:55:20
  Stop Status = G
  Last Result = 0

4. 解决归档缺失问题

Thread 1的Recovery Checkpoint修改正确了。启动OGG的抽取进程发现报如下错误:

***********************************************************************
**                     Run Time Messages                             **
***********************************************************************


2017-02-21 07:56:02  INFO    OGG-00732  Found crash recovery marker from thread #2 on sequence 57043 at RBA 2064. Aborting uncommitted transactions.

2017-02-21 07:56:12  INFO    OGG-01021  Command received from GGSCI: STATUS.

2017-02-21 07:56:17  INFO    OGG-01021  Command received from GGSCI: STATUS.

Source Context :
  SourceModule            : [er.redo.ora.rtc]
  SourceID                : [/scratch/aime1/adestore/views/aime1_staoi06/oggcore/OpenSys/src/app/er/redo/oracle/rtc.c]
  SourceFunction          : [RTC_producer]
  SourceLine              : [2115]

2017-02-21 07:58:05  ERROR   OGG-01028  Could not find archived log for sequence 105078 thread 1 under default destinations SQL <SELECT  name    FROM v$archived_log
   WHERE sequence# = :ora_seq_no AND         thread# = :ora_thread AND         resetlogs_id = :ora_resetlog_id AND         archived = 'YES' AND         deleted = 'N
O' AND         name not like '+%'         AND standby_dest = 'NO' >, error retrieving redo file name for sequence 105078, archived = 1, use_alternate = 0.

***********************************************************************
*                   ** Run Time Statistics **                         *
***********************************************************************


Report at 2017-02-21 07:58:06 (activity since 2017-02-21 07:55:58)

Output to /gg/dirdat/xx:

No records extracted.

报缺Thread 1的105078归档,这个归档文件前面确认过,是存在。
检查OGG的参数文件,由于2节点还没好,数据库实例是没有起的,确认TRANLOGOPTIONS ALTARCHIVELOGDEST行被注释掉。

numfiles 8000
dynamicresolution
DBOPTIONS ALLOWUNUSEDCOLUMN
GETUPDATEBEFORES
THREADOPTIONS PROCESSTHREADS SELECT 1
THREADOPTIONS PROCESSTHREADS SELECT 2
--TRANLOGOPTIONS ALTARCHIVELOGDEST primary instance xxxxx1 /archgg/xxxxx, ALTARCHIVELOGDEST instance  xxxxx2 /archgg/xxxxx
BR BROFF

检查归档信息,发现105078归档已经被删除了

SQL> select * from (
  2  SELECT  name,dest_id,thread#,sequence#,deleted,status  FROM v$archived_log
  3     WHERE sequence# >= 105078 and thread# = 1
  4  order by sequence#
  5  ) where rownum < 20;
NAME                                     DEST_ID    THREAD#  SEQUENCE#  DELETED STATUS
------------------------------------- ---------- ---------- ----------  ------- ------
                                               1          1     105078  YES     D     
                                               2          1     105078  YES     D     
                                               1          1     105079  YES     D     
/archgg/xxxxx/1_105079_863885595.dbf           2          1     105079  NO      A     
                                               1          1     105080  YES     D     
/archgg/xxxxx/1_105080_863885595.dbf           2          1     105080  NO      A     
                                               1          1     105081  YES     D     
/archgg/xxxxx/1_105081_863885595.dbf           2          1     105081  NO      A     
                                               1          1     105082  YES     D     
/archgg/xxxxx/1_105082_863885595.dbf           2          1     105082  NO      A     
                                               1          1     105083  YES     D     
/archgg/xxxxx/1_105083_863885595.dbf           2          1     105083  NO      A     
                                               1          1     105084  YES     D     
/archgg/xxxxx/1_105084_863885595.dbf           2          1     105084  NO      A     
                                               1          1     105085  YES     D     
/archgg/xxxxx/1_105085_863885595.dbf           2          1     105085  NO      A     
                                               1          1     105086  YES     D     
/archgg/xxxxx/1_105086_863885595.dbf           2          1     105086  NO      A     
                                               1          1     105087  YES     D     
19 rows selected

先不去管这个归档为什么被删掉了,既然在文件系统上存在,就先注册下

SQL> ALTER DATABASE REGISTER PHYSICAL LOGFILE '/archgg/xxxxx/1_105078_863885595.dbf';

Database altered.

注册完再起OGG抽取进程就正常了。

关于紫砂壶

感悟技术人生
此条目发表在GoldenGate分类目录,贴了标签。将固定链接加入收藏夹。