RAC故障处理记要

1. 故障说明

维护人员报告,数据库服务无法正常启动,看了下服务状态,确实如此。

$ crsctl status res -t
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS       
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.LISTENER.lsnr
               ONLINE  ONLINE       xxxxxxxxx01                                  
               ONLINE  ONLINE       xxxxxxxxx02                                  
ora.asm
               OFFLINE OFFLINE      xxxxxxxxx01              Instance Shutdown   
               OFFLINE OFFLINE      xxxxxxxxx02                                  
ora.gsd
               OFFLINE OFFLINE      xxxxxxxxx01                                  
               OFFLINE OFFLINE      xxxxxxxxx02                                  
ora.net1.network
               ONLINE  ONLINE       xxxxxxxxx01                                  
               ONLINE  ONLINE       xxxxxxxxx02                                  
ora.ons
               ONLINE  ONLINE       xxxxxxxxx01                                  
               ONLINE  ONLINE       xxxxxxxxx02                                  
ora.registry.acfs
               OFFLINE OFFLINE      xxxxxxxxx01                                  
               OFFLINE OFFLINE      xxxxxxxxx02                                  
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  OFFLINE                                                   
ora.xxxxxxx.xxxxxxx_media1.svc
      1        ONLINE  OFFLINE                                                   
ora.xxxxxxx.xxxxxxx_media2.svc
      1        ONLINE  OFFLINE                                                   
ora.xxxxxxx.db
      1        ONLINE  OFFLINE                               Instance Shutdown   
      2        ONLINE  ONLINE       xxxxxxxxx02              Open                
ora.xxxxxxxxx01.vip
      1        ONLINE  OFFLINE                                                   
ora.xxxxxxxxx02.vip
      1        ONLINE  ONLINE       xxxxxxxxx02                                  
ora.cvu
      1        OFFLINE OFFLINE                                                   
ora.oc4j
      1        ONLINE  OFFLINE                                                   
ora.scan1.vip
      1        ONLINE  OFFLINE      

服务确实不正常,节点1的实例是宕的,节点2的实例是起着的。
看了下节点1数据库的alert日志,发现实例去年就宕了,就一直没有起。这套库业务偷偷摸摸地上线了,现在出了问题业务没法用了,才找过来。
由于没有监控,所以也就不知道情况了。

2. 重启服务

$ srvctl stop service -d xxxxxxx -s xxxxxxx_media1 -f
PRCR-1005 : 资源 ora.xxxxxxx.xxxxxxx_media1.svc 已停止

$ export SRVM_TRACE=true
$ srvctl start service -d xxxxxxx -s xxxxxxx_media1
[main] [ 2017-01-12 11:46:34.519 CST ] [OPSCTLDriver.setInternalDebugLevel:198]  tracing is true at level 2 to file null
[main] [ 2017-01-12 11:46:34.521 CST ] [OPSCTLDriver.main:141]  SRVCTL arguments : args[0]=start args[1]=service args[2]=-d args[3]=xxxxxxx args[4]=-s args[5]=xxxxxxx_media1 
[main] [ 2017-01-12 11:46:34.559 CST ] [Version.isPre:592]  version to be checked 11.2.0.4.0 major version to check against10
[main] [ 2017-01-12 11:46:34.560 CST ] [Version.isPre:603]  isPre.java: Returning FALSE
......
[main] [ 2017-01-12 11:46:35.335 CST ] [CRSNative.internalStartResource:376]  About to start resource: Name: ora.xxxxxxx.xxxxxxx_media1.svc, force:true node: null, options: 0, filter null
[main] [ 2017-01-12 11:46:35.357 CST ] [CRSNativeResult.addLine:106]  callback: ora.xxxxxxx.xxxxxxx_media1.svc true CRS-0215: Could not start resource 'ora.xxxxxxx.xxxxxxx_media1.svc 1 1'.
[main] [ 2017-01-12 11:46:35.360 CST ] [CRSNativeResult.addComp:162]  add comp: name ora.xxxxxxx.xxxxxxx_media1.svc, rc 223, msg CRS-0223: Resource 'ora.xxxxxxx.xxxxxxx_media1.svc' has placement error.
[main] [ 2017-01-12 11:46:35.361 CST ] [CRSNative.internalStartResource:389]  Failed to start resource: Name: ora.xxxxxxx.xxxxxxx_media1.svc, node: null, filter: null, msg CRS-0215: Could not start resource 'ora.xxxxxxx.xxxxxxx_media1.svc 1 1'.
PRCD-1084 : 无法启动服务 xxxxxxx_media1
PRCR-1079 : 无法启动资源 ora.xxxxxxx.xxxxxxx_media1.svc
CRS-0215: Could not start resource 'ora.xxxxxxx.xxxxxxx_media1.svc 1 1'.

重启服务,没有起起来,而crsd.log里的相关信息,也没看出个所以然来。重启到实例好的2节点,也没有起起来。

$ srvctl start service -d xxxxxxx -s xxxxxxx_media1 -i xxxxxxx2
[main] [ 2017-01-12 11:49:24.884 CST ] [OPSCTLDriver.setInternalDebugLevel:198]  tracing is true at level 2 to file null
[main] [ 2017-01-12 11:49:24.886 CST ] [OPSCTLDriver.main:141]  SRVCTL arguments : args[0]=start args[1]=service args[2]=-d args[3]=xxxxxxx args[4]=-s args[5]=xxxxxxx_media1 args[6]=-i args[7]=xxxxxxx2 
[main] [ 2017-01-12 11:49:24.926 CST ] [Version.isPre:592]  version to be checked 11.2.0.4.0 major version to check against10
[main] [ 2017-01-12 11:49:24.926 CST ] [Version.isPre:603]  isPre.java: Returning FALSE
[main] [ 2017-01-12 11:49:24.929 CST ] [OCR.loadLibrary:312]  
......
[main] [ 2017-01-12 11:49:25.829 CST ] [CRSNative.getStat:1449]  looking for following attributes:
[main] [ 2017-01-12 11:49:25.830 CST ] [CRSNative.getStat:1451]                 'DATABASE_TYPE'
[main] [ 2017-01-12 11:49:25.849 CST ] [CRSNative.getStat:1464]  crs found 1 attributes
[main] [ 2017-01-12 11:49:25.850 CST ] [CRSNative.getStat:1474]         Name: 'DATABASE_TYPE'; Value: 'RAC'
[main] [ 2017-01-12 11:49:25.851 CST ] [CRSNative.internalStartResource:376]  About to start resource: Name: ora.xxxxxxx.xxxxxxx_media1.svc, force:true node: xxxxxxxxx02, options: 0, filter null
[main] [ 2017-01-12 11:49:25.871 CST ] [CRSNativeResult.addLine:106]  callback: ora.xxxxxxx.xxxxxxx_media1.svc true CRS-0215: Could not start resource 'ora.xxxxxxx.xxxxxxx_media1.svc 1 1'.
[main] [ 2017-01-12 11:49:25.875 CST ] [CRSNativeResult.addComp:162]  add comp: name ora.xxxxxxx.xxxxxxx_media1.svc, rc 223, msg CRS-0223: Resource 'ora.xxxxxxx.xxxxxxx_media1.svc' has placement error.
[main] [ 2017-01-12 11:49:25.876 CST ] [CRSNative.internalStartResource:389]  Failed to start resource: Name: ora.xxxxxxx.xxxxxxx_media1.svc, node: xxxxxxxxx02, filter: null, msg CRS-0215: Could not start resource 'ora.xxxxxxx.xxxxxxx_media1.svc 1 1'.
PRCR-1013 : 无法启动资源 ora.xxxxxxx.xxxxxxx_media1.svc
PRCR-1064 : 无法在节点 xxxxxxxxx02 上启动资源 ora.xxxxxxx.xxxxxxx_media1.svc
CRS-0215: Could not start resource 'ora.xxxxxxx.xxxxxxx_media1.svc 1 1'.

3. 尝试处理VIP

这时侯发现节点1的的VIP有问题,尝试启动几次VIP,均无法启动。

$ srvctl stop vip -n xxxxxxxxx01 -f
$ srvctl start vip -n xxxxxxxxx01
PRCR-1079 : 无法启动资源 ora.xxxxxxxxx01.vip
CRS-0215: Could not start resource 'ora.xxxxxxxxx01.vip 1 1'.

把节点1的CRS停了后,发现VIP还在网卡上,这奇怪了。在CRS停止的状态下,我就手工把节点1的VIP给删除

# ip addr del xxx.xxx.13.94 dev eth2
Warning: Executing wildcard deletion to stay compatible with old scripts.
         Explicitly specify the prefix length (xxx.xxx.13.94/32) to avoid this warning.
         This special behaviour is likely to disappear in further releases,
         fix your scripts!

# ip addr del xxx.xxx.13.96 dev eth2
Warning: Executing wildcard deletion to stay compatible with old scripts.
         Explicitly specify the prefix length (xxx.xxx.13.96/32) to avoid this warning.
         This special behaviour is likely to disappear in further releases,
         fix your scripts!

再次重启CRS,VIP还是无法启动。这里我漏做了个事情,应该去ping一下网关的,可那时侯忘记了如何看Linux的网关命令,这里记录一下

# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
xxx.xxx.13.0    *               255.255.255.128 U     0      0        0 eth2
192.168.13.0    *               255.255.255.0   U     0      0        0 eth3
link-local      *               255.255.0.0     U     0      0        0 eth3
default         xxx.xxx.13.1    0.0.0.0         UG    0      0        0 eth2

default就是网关,vip故障,很有可能就是网关ping不通。

由于数据库已经没法用了,为了快速处理故障,就把两个主机全部重启了。重启后,数据库正常。

4. 事后分析

事后分析crsd.log,发现如下信息:

2017-01-12 11:55:45.463: [UiServer][1799329536]{1:41024:56} Sending message to PE. ctx= 0x7ff9e000ad60, Client PID: 988
2017-01-12 11:55:45.464: [UiServer][1799329536]{1:41024:56} Master is not known. Rejecting the command: 5
2017-01-12 11:55:45.475: [UiServer][1797228288] CS(0x7ff9e4007260)set Properties ( root,0x7ffa180dabc0)
2017-01-12 11:55:45.485: [UiServer][1799329536]{1:41024:57} Sending message to PE. ctx= 0x7ff9e0009de0, Client PID: 988
2017-01-12 11:55:45.485: [UiServer][1799329536]{1:41024:57} Master is not known. Rejecting the command: 6

可能与:crsd stuck in intermediate state and crsd.log repeats “Master is not known” due to network issue (文档 ID 1587260.1)有关系。

关于紫砂壶

感悟技术人生
此条目发表在未分类分类目录,贴了标签。将固定链接加入收藏夹。