1. 说明
默认情况下,Oracle 12c RAC,会将口令文件存储在ASM中,这达到了所有Oracle文件由ASM统一管理,由于ASMFD的存在,避免了误删除,但是有时侯口令文件存储在ASM中,确带来了问题,以下就是这么一种情况。
2. 业务报障
有一天,业务报告无法连接一套Oracle 12c RAC库了,报ORA-00257,归档错误,只允许SYSDBA登录,这套Oracle 12c RAC库,前面实施了ADG,但是还没有上备份,所有归档都是靠脚本删除,具体删除脚本就不列了。
3. 故障诊断
3.1 检查alert日志
2018-01-12T08:00:28.184006+08:00 TT03: Standby redo logfile selected for thread 1 sequence 49349 for destination LOG_ARCHIVE_DEST_2 2018-01-12T09:11:27.521632+08:00 Non critical error ORA-48113 caught while writing to trace file "/oracle/app/oracle/diag/rdbms/zhyydb/zhyydb1/trace/zhyydb1_gen0_20458.trc" Error message: Writing to the above trace file is disabled for now on... 2018-01-12T09:11:37.330519+08:00 Non critical error ORA-48113 caught while writing to trace file "/oracle/app/oracle/diag/rdbms/zhyydb/zhyydb1/trace/zhyydb1_lmhb_20514.trc" Error message: Writing to the above trace file is disabled for now on... 2018-01-12T09:11:40.266490+08:00 Non critical error ORA-48113 caught while writing to trace file "/oracle/app/oracle/diag/rdbms/zhyydb/zhyydb1/trace/zhyydb1_mmon_20574.trc" Error message: Writing to the above trace file is disabled for now on... 2018-01-12T09:12:38.430984+08:00 Non critical error ORA-48113 caught while writing to trace metadata file (.trm) Trace file name: "/oracle/app/oracle/diag/rdbms/zhyydb/zhyydb1/trace/zhyydb1_ora_32481.trc" Error message: Writing to trace metadata is disabled for now on... Non critical error ORA-48113 caught while writing to trac2018-01-12T11:18:35.686499+08:00 Errors in file /oracle/app/oracle/diag/rdbms/zhyydb/zhyydb1/trace/zhyydb1_tt00_20785.trc: ORA-01017: invalid username/password; logon denied 2018-01-12T11:18:38.803014+08:00 Error 1017 received logging on to the standby ------------------------------------------------------------ Check that the primary and standby are using a password file and remote_login_passwordfile is set to SHARED or EXCLUSIVE, and that the SYS password is same in the password files. returning error ORA-16191 ------------------------------------------------------------ 2018-01-12T11:23:38.875352+08:00 Errors in file /oracle/app/oracle/diag/rdbms/zhyydb/zhyydb1/trace/zhyydb1_tt00_20785.trc: ORA-01017: invalid username/password; logon denied 2018-01-12T11:23:41.989872+08:00 Error 1017 received logging on to the standby ------------------------------------------------------------ Check that the primary and standby are using a password file and remote_login_passwordfile is set to SHARED or EXCLUSIVE, and that the SYS password is same in the password files. returning error ORA-16191 ------------------------------------------------------------ 2018-01-12T11:24:37.571472+08:00 krsd_check_stuck_arch: stuck archiver condition cleared
写trace文件被禁止,检查了下/oracle文件系统,发现是/oracle文件系统满了,将/oracle文件系统清理后,发现故障依旧,日志还是无法切换。
3.2 恢复业务
检查ASM,归档空间未满,当前数据库是ADG的主库,暂时未找到解决办法,先临时禁用DataGuard的日志传输恢复业务。
SQL> ALTER SYSTEM SET log_archive_dest_state_2='DEFER' SCOPE=BOTH SID='*';
设置log_archive_dest_state_2为defer后,发现日志可以切换了,业务恢复
3.3 分析ORA-16191和ORA-01017
这两个报错解释如下:
$ oerr ora 16191 16191, 0000, "Primary log shipping client not logged on standby" // *Cause: An attempt to ship redo to standby without logging on // to standby or with invalid user credentials. // *Action: Check that primary and standby are using password files and that // both primary and standby have the same SYS password. // Restart primary and/or standby after ensuring that // password file is accessible and REMOTE_LOGIN_PASSWORDFILE // initialization parameter is set to SHARED or EXCLUSIVE. $ oerr ora 1017 01017, 00000, "invalid username/password; logon denied" // *Cause: // *Action:
从这里可以显示,DataGuard无法传输日志到备库是因为口令不正确,难道有人修改了SYS口令,另外这套DataGuard的日志是通过DGDBA用户去传输的,难道没有生效。
SQL> show parameter redo_transport_user NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ redo_transport_user string DGDBA
测试SYS用户登录,发现通过正确的口令无法登录zhyydb1节点,但是可以登录zhyydb2节点
$ sqlplus "sys/xxxxxx@zhyydb1 as sysdba" SQL*Plus: Release 12.2.0.1.0 Production on Fri Jan 12 11:24:49 2018 Copyright (c) 1982, 2016, Oracle. All rights reserved. ERROR: ORA-01017: invalid username/password; logon denied Enter user-name:
所以应该是这个原因,造成了DataGuard的问题。在分析ORA-01017过程中,发现节点1上的v$pwfile_users视图为空,检查$ORACLE_HOME/dbs中,没有发现口令文件,所以口令文件应该是存在ASM里的,查看database的配置信息,发现无法获取信息:
$ srvctl config database -d zhyydb PRCD-1027 : Failed to retrieve database zhyydb PRCR-1070 : Failed to check if resource ora.zhyydb.db is registered CRS-0184 : Cannot communicate with the CRS daemon.
检查grid的CRS信息,发现也是无法获取
$ crsctl status res -t CRS-4535: Cannot communicate with Cluster Ready Services CRS-4000: Command Status failed, or completed with errors.
检查CRS的后台资源,发现ora.crsd资源是offline状态
$ crsctl status res -t -init --------------------------------------------------------------- Name Target State Server State details --------------------------------------------------------------- Cluster Resources --------------------------------------------------------------- ...... ora.crsd 1 ONLINE OFFLINE STABLE ora.cssd 1 ONLINE ONLINE zhyydb01 STABLE ...... --------------------------------------------------------------- 将ora.crsd这个资源拉起来后,集群状态正常 $ crsctl start resource ora.crsd -init CRS-2672: Attempting to start 'ora.crsd' on 'zhyydb01' CRS-2676: Start of 'ora.crsd' on 'zhyydb01' succeeded
检查数据库中的v$pwfile_users,已经可以查看到信息。再次通过口令测试SYS用户的登录,可以正常登录。
3.4 恢复DataGuard
恢复DataGuard的日志传输
ALTER SYSTEM SET log_archive_dest_state_2='ENABLE' SCOPE=BOTH SID='*';
主库自动将备库缺失的归档传输过去,DataGuard恢复正常
4. 总结
故障的主要原因是,由于没有监控,节点1的/oracle目录满了造成ora.crsd资源异常offline。
由于口令文件存在ASM中,数据库实例无法读取到ASM中的口令文件,造成v$pwfile_users无任何特权用户信息,备库无法登录到主库,造成主库在进行日志切换时,无法将日志传输到备库,造成主库夯住。重启ora.crsd资源后,恢复v$pwfile_users的信息,解决用户登录问题后,主库到备库的日志传输问题恢复。