Recovering Oracle Clusterware after losing a VIP
It is fairly common knowledge that in Oracle 10gR2 clusterware, before the 10.2.0.3 patchset, that there is a real dependency on the public interface that the clusterware VIP is running on. Should this interface fail then Oracle will decide to bring down (crash) all database instances including ASM instances running on the affected node. Here is how we recovered when after upgrading some cisco switches we lost our VIP interfaces on a RAC cluster:
No Instances were running and in the alert log for the instance the last entry was:
ORA-15064: communication failure with ASM instance ASMB: terminating instance due to error 15064
I had assumed there was a problem with ASM but looking in the Clusterware logs (/opt/oracle/product/crs/log/servername/crsd/crsd.log) showed that the first problem was with the vip:
2007-02-13 11:05:50.090: [CRSAPP][1520634208]0CheckResource error for ora.server1.vip error code = 1 2007-02-13 11:05:50.092: [CRSRES][1520634208]0In stateChanged, ora.server1.vip target is ONLINE 2007-02-13 11:05:50.093: [CRSRES][1520634208]0ora.server1.vip on server1 went OFFLINE unexpectedly 2007-02-13 11:05:50.093: [CRSRES][1520634208]0StopResource: setting CLI values
Taking a look at the status of various resources with crs_stat showed several to be offline and some running on the wrong node:
/opt/oracle/product/crs/bin/crs_stat -t -v Name Type R/RA F/FT Target State Host ---------------------------------------------------------------------- ora.SERVER1.inst application 0/5 0/0 ONLINE UNKNOWN server1 ora.SERVER2.inst application 0/5 0/0 ONLINE OFFLINE ora.SERVER.db application 0/0 0/1 OFFLINE OFFLINE ora....SM1.asm application 0/5 0/0 ONLINE OFFLINE ora....C1.lsnr application 0/5 0/0 ONLINE OFFLINE ora....ac1.gsd application 0/5 0/0 ONLINE ONLINE server1 ora....ac1.ons application 1/3 0/0 ONLINE OFFLINE ora....ac1.vip application 0/0 0/0 ONLINE OFFLINE ora....SM2.asm application 0/5 0/0 ONLINE OFFLINE ora....C2.lsnr application 0/5 0/0 ONLINE OFFLINE ora....ac2.gsd application 0/5 0/0 ONLINE ONLINE server2 ora....ac2.ons application 0/3 0/0 ONLINE ONLINE server2 ora....ac2.vip application 0/0 0/0 ONLINE ONLINE server1
The first thing to do is bring the VIP online on the affected host:
srvctl start nodeapps -n server1
Start the ASM instance:
srvctl start asm -n server1
Clusterware was still showing the INSTANCE on server1 as being in an UNKNOWN state. I could not start it so had to stop (even though it was not running) and then start it to make it work:
srvctl stop instance -d SERVER -i SERVER1 srvctl start instance -d SERVER -i SERVER1
Finally you need start a listener:
srvctl start listener -n SERVER1
With the latest 10.2.0.3 patchset a lot of this goes away BUT the VIP still has to be restarted with the nodeapps command above AND the listener also gets stopped so must also still be restarted. Seems like a small network blip on your public VIP interface is still going to cause your RAC cluster some pain. Makes bonding your interfaces seem even more important.

(1 votes, average: 4 out of 5)
February 22nd, 2007 at 4:20 am
That’s a good example. It’s a bit tricky and every shop should go through those failures in test and have procedure how to recover.
By the way, you can also use crs_stop with force option to stop resource stuck in unknown state. Sometimes, srvctl will just be useless.
I also mentioned this new VIP dependency removal in 10.2.0.3 with few details. Just in case someone finds it interesting.
March 26th, 2007 at 2:23 pm
Actually Alex is wrong. Whereas crs_stop (with force) may appear to work. The use of crs_stop (and many other crs_ commands) is unsupported by Oracle. Check chapter 14 of the Oracle Clusterware and RAC Admin guide. srvctl is the recommended interface.
July 5th, 2007 at 2:12 pm
Hi.
crs_xxx tools are definitely not supported by Oracle
as documented in specified manual.
But if you need to quickly failback VIP just use
crs_relocate ora.server1.vip
Oleksandr
March 1st, 2008 at 4:27 am
Dear All
Hi
Is there any solution to solve this problem?
After installing 10.2.0.3 patch i still have this problem!!!
Thank you in advance.
bye
March 3rd, 2008 at 8:50 am
Hi MSN,
That is interesting. The problem we saw above was at our business continuity site running 10.2.0.2.
We then upgraded our production site to 10.2.0.3 before we did any changes to the switches at the production site.
We did not encounter a problem with 10.2.0.3 running on rhel u3 x86-64.
What platform are you finding the instance crashing with the loss of network connectivity?
jason.