12c口令文件存储在ASM中对DataGuard的影响

1. 说明

默认情况下,Oracle 12c RAC,会将口令文件存储在ASM中,这达到了所有Oracle文件由ASM统一管理,由于ASMFD的存在,避免了误删除,但是有时侯口令文件存储在ASM中,确带来了问题,以下就是这么一种情况。

2. 业务报障

有一天,业务报告无法连接一套Oracle 12c RAC库了,报ORA-00257,归档错误,只允许SYSDBA登录,这套Oracle 12c RAC库,前面实施了ADG,但是还没有上备份,所有归档都是靠脚本删除,具体删除脚本就不列了。

3. 故障诊断

3.1 检查alert日志

2018-01-12T08:00:28.184006+08:00
TT03: Standby redo logfile selected for thread 1 sequence 49349 for destination LOG_ARCHIVE_DEST_2
2018-01-12T09:11:27.521632+08:00
Non critical error ORA-48113 caught while writing to trace file "/oracle/app/oracle/diag/rdbms/zhyydb/zhyydb1/trace/zhyydb1_gen0_20458.trc"
Error message: 
Writing to the above trace file is disabled for now on...
2018-01-12T09:11:37.330519+08:00
Non critical error ORA-48113 caught while writing to trace file "/oracle/app/oracle/diag/rdbms/zhyydb/zhyydb1/trace/zhyydb1_lmhb_20514.trc"
Error message: 
Writing to the above trace file is disabled for now on...
2018-01-12T09:11:40.266490+08:00
Non critical error ORA-48113 caught while writing to trace file "/oracle/app/oracle/diag/rdbms/zhyydb/zhyydb1/trace/zhyydb1_mmon_20574.trc"
Error message: 
Writing to the above trace file is disabled for now on...
2018-01-12T09:12:38.430984+08:00
Non critical error ORA-48113 caught while writing to trace metadata file (.trm)
Trace file name: "/oracle/app/oracle/diag/rdbms/zhyydb/zhyydb1/trace/zhyydb1_ora_32481.trc"
Error message: 
Writing to trace metadata is disabled for now on...
Non critical error ORA-48113 caught while writing to trac2018-01-12T11:18:35.686499+08:00
Errors in file /oracle/app/oracle/diag/rdbms/zhyydb/zhyydb1/trace/zhyydb1_tt00_20785.trc:
ORA-01017: invalid username/password; logon denied
2018-01-12T11:18:38.803014+08:00
Error 1017 received logging on to the standby
------------------------------------------------------------
Check that the primary and standby are using a password file
and remote_login_passwordfile is set to SHARED or EXCLUSIVE, 
and that the SYS password is same in the password files.
      returning error ORA-16191
------------------------------------------------------------
2018-01-12T11:23:38.875352+08:00
Errors in file /oracle/app/oracle/diag/rdbms/zhyydb/zhyydb1/trace/zhyydb1_tt00_20785.trc:
ORA-01017: invalid username/password; logon denied
2018-01-12T11:23:41.989872+08:00
Error 1017 received logging on to the standby
------------------------------------------------------------
Check that the primary and standby are using a password file
and remote_login_passwordfile is set to SHARED or EXCLUSIVE, 
and that the SYS password is same in the password files.
      returning error ORA-16191
------------------------------------------------------------
2018-01-12T11:24:37.571472+08:00
krsd_check_stuck_arch: stuck archiver condition cleared

写trace文件被禁止,检查了下/oracle文件系统,发现是/oracle文件系统满了,将/oracle文件系统清理后,发现故障依旧,日志还是无法切换。

3.2 恢复业务

检查ASM,归档空间未满,当前数据库是ADG的主库,暂时未找到解决办法,先临时禁用DataGuard的日志传输恢复业务。

SQL> ALTER SYSTEM SET log_archive_dest_state_2='DEFER' SCOPE=BOTH SID='*';

设置log_archive_dest_state_2为defer后,发现日志可以切换了,业务恢复

3.3 分析ORA-16191和ORA-01017

这两个报错解释如下:

$ oerr ora 16191
16191, 0000, "Primary log shipping client not logged on standby"
// *Cause:  An attempt to ship redo to standby without logging on 
// to standby or with invalid user credentials. 
// *Action: Check that primary and standby are using password files and that 
//          both primary and standby have the same SYS password. 
//          Restart primary and/or standby after ensuring that 
//          password file is accessible and REMOTE_LOGIN_PASSWORDFILE
//          initialization parameter is set to SHARED or EXCLUSIVE.

$ oerr ora 1017
01017, 00000, "invalid username/password; logon denied"
// *Cause:
// *Action:

从这里可以显示,DataGuard无法传输日志到备库是因为口令不正确,难道有人修改了SYS口令,另外这套DataGuard的日志是通过DGDBA用户去传输的,难道没有生效。

SQL> show parameter redo_transport_user

NAME                                TYPE        VALUE
------------------------------------ ----------- ------------------------------
redo_transport_user                  string      DGDBA

测试SYS用户登录,发现通过正确的口令无法登录zhyydb1节点,但是可以登录zhyydb2节点

$ sqlplus "sys/xxxxxx@zhyydb1 as sysdba"

SQL*Plus: Release 12.2.0.1.0 Production on Fri Jan 12 11:24:49 2018

Copyright (c) 1982, 2016, Oracle.  All rights reserved.

ERROR:
ORA-01017: invalid username/password; logon denied


Enter user-name:

所以应该是这个原因,造成了DataGuard的问题。在分析ORA-01017过程中,发现节点1上的v$pwfile_users视图为空,检查$ORACLE_HOME/dbs中,没有发现口令文件,所以口令文件应该是存在ASM里的,查看database的配置信息,发现无法获取信息:

$ srvctl config database -d zhyydb
PRCD-1027 : Failed to retrieve database zhyydb
PRCR-1070 : Failed to check if resource ora.zhyydb.db is registered
CRS-0184 : Cannot communicate with the CRS daemon.

检查grid的CRS信息,发现也是无法获取

$ crsctl status res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.

检查CRS的后台资源,发现ora.crsd资源是offline状态

$ crsctl status res -t -init
---------------------------------------------------------------
Name          Target  State    Server      State details      
---------------------------------------------------------------
Cluster Resources
---------------------------------------------------------------
......
ora.crsd
      1        ONLINE  OFFLINE              STABLE
ora.cssd
      1        ONLINE  ONLINE  zhyydb01    STABLE
......
---------------------------------------------------------------
将ora.crsd这个资源拉起来后,集群状态正常

$ crsctl start resource ora.crsd -init
CRS-2672: Attempting to start 'ora.crsd' on 'zhyydb01'
CRS-2676: Start of 'ora.crsd' on 'zhyydb01' succeeded

检查数据库中的v$pwfile_users,已经可以查看到信息。再次通过口令测试SYS用户的登录,可以正常登录。

3.4 恢复DataGuard

恢复DataGuard的日志传输

ALTER SYSTEM SET log_archive_dest_state_2='ENABLE' SCOPE=BOTH SID='*';

主库自动将备库缺失的归档传输过去,DataGuard恢复正常

4. 总结

故障的主要原因是,由于没有监控,节点1的/oracle目录满了造成ora.crsd资源异常offline。
由于口令文件存在ASM中,数据库实例无法读取到ASM中的口令文件,造成v$pwfile_users无任何特权用户信息,备库无法登录到主库,造成主库在进行日志切换时,无法将日志传输到备库,造成主库夯住。重启ora.crsd资源后,恢复v$pwfile_users的信息,解决用户登录问题后,主库到备库的日志传输问题恢复。

关于紫砂壶

感悟技术人生
此条目发表在Oracle故障诊断分类目录,贴了标签。将固定链接加入收藏夹。