Oracle 11gRAC RDS协议心跳故障诊断处理

1. 故障说明

有一套Oracle 11gRAC节点2故障重启后,无法启动GI。

2. 故障诊断

检查资源状态

[oracle@xxxx302 ~]$ crsctl status res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.

检查GI服务

[oracle@xxxx302 ~]$ crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
CRS-4534: Cannot communicate with Event Manager

从上面看,CSSD有问题

[oracle@xxxx302 ~]$ crsctl status res -t -init
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS       
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
      1        OFFLINE OFFLINE                                                   
ora.cluster_interconnect.haip
      1        ONLINE  OFFLINE                                                   
ora.crf
      1        OFFLINE OFFLINE                                                   
ora.crsd
      1        ONLINE  OFFLINE                                                   
ora.cssd
      1        ONLINE  OFFLINE                                                   
ora.cssdmonitor
      1        ONLINE  ONLINE       xxxx302                                      
ora.ctssd
      1        ONLINE  OFFLINE                                                   
ora.diskmon
      1        OFFLINE OFFLINE                                                   
ora.drivers.acfs
      1        ONLINE  OFFLINE                                                   
ora.evmd
      1        ONLINE  OFFLINE                                                   
ora.gipcd
      1        ONLINE  ONLINE       xxxx302                                      
ora.gpnpd
      1        ONLINE  ONLINE       xxxx302                                      
ora.mdnsd
      1        ONLINE  ONLINE       xxxx302     

确认cssd存在问题。

3. 诊断CSSD故障

GRID的alert报错如下:

2017-02-23 09:29:04.197: 
[cssd(14265)]CRS-1605:CSSD voting file is online: /ocrvote/xxxxdb3/vote1; details in /oracle/product/11.2.0/grid/log/xxxx302/cssd/ocssd.log.
2017-02-23 09:38:35.463: 
[/oracle/product/11.2.0/grid/bin/cssdagent(14248)]CRS-5818:Aborted command 'start' for resource 'ora.cssd'. Details at (:CRSAGF00113:) {0:71:10} in /oracle/product/11.2.0/grid/log/xxxx302/agent/ohasd/oracssdagent_root//oracssdagent_root.log.
2017-02-23 09:38:35.464: 
[cssd(14265)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /oracle/product/11.2.0/grid/log/xxxx302/cssd/ocssd.log

启动ocssd发生问题,检查ocssd.log发现如下信息:

2017-02-26 08:35:39.154: [    CSSD][2098923264]clssnmvDHBValidateNcopy: Copying unique 1487686105 to node structure for node xxxx301, number 1; previous unique value was 0
2017-02-26 08:35:39.154: [    CSSD][2098923264]clssnmvDHBValidateNcopy: node 1, xxxx301, has a disk HB, but no network HB, DHB has rcfg 376409126, wrtcnt, 22021294, LATS 4294241630, lastSeqNo 22021254, uniqueness 1487686105, timestamp 1488069339/454084304
2

有磁盘心跳,但是无网络心跳。检查Oracle私网心跳,发现能够ping通

[root@xxxx301 ~]# ping xxxx302-priv
PING xxxx302-priv (192.168.7.25) 56(84) bytes of data.
64 bytes from xxxx302-priv (192.168.7.25): icmp_seq=1 ttl=64 time=0.045 ms
64 bytes from xxxx302-priv (192.168.7.25): icmp_seq=2 ttl=64 time=0.044 ms
64 bytes from xxxx302-priv (192.168.7.25): icmp_seq=3 ttl=64 time=0.055 ms

检查心跳网络配置

[grid@xxxx301 ~]$ oifcfg getif
ib1  192.168.7.0  global  cluster_interconnect
bond0  xxx.xxx.7.0  global  public

检查IP,确认IP所在网卡名称一致

[root@xxxx301 ~]# ip addr | grep ib
8: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast state UP qlen 256
    link/infiniband 80:00:00:48:fe:80:00:00:00:00:00:00:24:8a:07:03:00:66:92:71 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    inet 171.252.7.24/24 brd 171.252.7.255 scope global ib0
9: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast state UP qlen 256
    link/infiniband 80:00:00:49:fe:80:00:00:00:00:00:00:24:8a:07:03:00:66:92:72 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    inet 192.168.7.24/24 brd 192.168.7.255 scope global ib1
    inet 169.254.25.35/16 brd 169.254.255.255 scope global ib1:1
10: ib2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast state UP qlen 256
    link/infiniband 80:00:00:48:fe:80:00:00:00:00:00:00:e4:1d:2d:03:00:4c:79:41 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    inet 173.252.7.24/24 brd 173.252.7.255 scope global ib2
11: ib3: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN qlen 256
    link/infiniband 80:00:00:49:fe:80:00:00:00:00:00:00:e4:1d:2d:03:00:4c:79:42 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff

[root@xxxx302 ~]# ip addr | grep ib
8: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast state UP qlen 256
    link/infiniband 80:00:00:48:fe:80:00:00:00:00:00:00:24:8a:07:03:00:66:9b:71 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    inet 171.252.7.25/24 brd 171.252.7.255 scope global ib0
9: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast state UP qlen 256
    link/infiniband 80:00:00:49:fe:80:00:00:00:00:00:00:24:8a:07:03:00:66:9b:72 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    inet 192.168.7.25/24 brd 192.168.7.255 scope global ib1
10: ib2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast state UP qlen 256
    link/infiniband 80:00:00:48:fe:80:00:00:00:00:00:00:e4:1d:2d:03:00:4c:fe:b1 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    inet 173.252.7.25/24 brd 173.252.7.255 scope global ib2
11: ib3: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN qlen 256
    link/infiniband 80:00:00:49:fe:80:00:00:00:00:00:00:e4:1d:2d:03:00:4c:fe:b2 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff

Oracle私网IP是配置在IB卡上的,未使用绑定,这种情况猜测可能Oracle心跳使用了RDS协议。确认如下:
到节点1 DB的alert日志中检查:

Cluster communication is configured to use the following interface(s) for this instance
  169.254.25.35
cluster interconnect IPC version:Oracle RDS/IP (generic)
IPC Vendor 1 proto 3
  Version 4.1
Oracle instance running with ODM: Veritas 6.2.0.004 ODM Library, Version 2.0 
Tue Feb 21 22:09:43 2017
PMON started with pid=2, OS id=13407 

发现心跳确实是使用RDS协议,而不是UDP协议。到节点上1测试RDS协议可否ping通

发现无法ping通。到节点2上检查是否加载了RDS协议:

[root@xxxx302 ~]# lsmod | grep rds
[root@xxxx302 ~]# 

内核未加载RDS模块,检查rdma.conf文件,发现缺少RDS_LOAD=yes内容。

[root@xxxx302 ~]# vi /etc/rdma/rdma.conf
ONBOOT=yes
RDMA_UCM_LOAD=yes
MTHCA_LOAD=yes
IPOIB_LOAD=yes
SDP_LOAD=yes
MLX4_LOAD=yes
MLX4_EN_LOAD=yes

4. 解决故障

这台机器前面IB卡故障了,更换IB卡后,可能rdma.conf文件动了,在rdma.conf中添加以下内容

RDS_LOAD=yes

重启主机后,GI和DB就可以正常启了。

关于紫砂壶

感悟技术人生
此条目发表在Oracle故障诊断分类目录,贴了标签。将固定链接加入收藏夹。