random technical thoughts from the Nominet technical team

Recovering Oracle Clusterware after losing a VIP

1 Star2 Stars3 Stars4 Stars5 Stars (1 votes, average: 4 out of 5)
Loading ... Loading ...
Posted by jason on Feb 19th, 2007

It is fairly common knowledge that in Oracle 10gR2 clusterware, before the 10.2.0.3 patchset, that there is a real dependency on the public interface that the clusterware VIP is running on. Should this interface fail then Oracle will decide to bring down (crash) all database instances including ASM instances running on the affected node. Here is how we recovered when after upgrading some cisco switches we lost our VIP interfaces on a RAC cluster:

No Instances were running and in the alert log for the instance the last entry was:

ORA-15064: communication failure with ASM instance
ASMB: terminating instance due to error 15064

I had assumed there was a problem with ASM but looking in the Clusterware logs (/opt/oracle/product/crs/log/servername/crsd/crsd.log) showed that the first problem was with the vip:

2007-02-13 11:05:50.090: [CRSAPP][1520634208]0CheckResource error for ora.server1.vip error code = 1
2007-02-13 11:05:50.092: [CRSRES][1520634208]0In stateChanged, ora.server1.vip target is ONLINE
2007-02-13 11:05:50.093: [CRSRES][1520634208]0ora.server1.vip on server1 went OFFLINE unexpectedly
2007-02-13 11:05:50.093: [CRSRES][1520634208]0StopResource: setting CLI values

Taking a look at the status of various resources with crs_stat showed several to be offline and some running on the wrong node:

/opt/oracle/product/crs/bin/crs_stat -t -v
Name           Type           R/RA   F/FT   Target    State     Host
----------------------------------------------------------------------
ora.SERVER1.inst application    0/5    0/0    ONLINE    UNKNOWN   server1
ora.SERVER2.inst application    0/5    0/0    ONLINE    OFFLINE
ora.SERVER.db    application    0/0    0/1    OFFLINE   OFFLINE
ora....SM1.asm   application    0/5    0/0    ONLINE    OFFLINE
ora....C1.lsnr   application    0/5    0/0    ONLINE    OFFLINE
ora....ac1.gsd   application    0/5    0/0    ONLINE    ONLINE    server1
ora....ac1.ons   application    1/3    0/0    ONLINE    OFFLINE
ora....ac1.vip   application    0/0    0/0    ONLINE    OFFLINE
ora....SM2.asm   application    0/5    0/0    ONLINE    OFFLINE
ora....C2.lsnr   application    0/5    0/0    ONLINE    OFFLINE
ora....ac2.gsd   application    0/5    0/0    ONLINE    ONLINE    server2
ora....ac2.ons   application    0/3    0/0    ONLINE    ONLINE    server2
ora....ac2.vip   application    0/0    0/0    ONLINE    ONLINE    server1

The first thing to do is bring the VIP online on the affected host:

srvctl start nodeapps -n server1

Start the ASM instance:

srvctl start asm -n server1

Clusterware was still showing the INSTANCE on server1 as being in an UNKNOWN state. I could not start it so had to stop (even though it was not running) and then start it to make it work:

srvctl stop instance -d SERVER -i SERVER1
srvctl start instance -d SERVER -i SERVER1

Finally you need start a listener:

srvctl start listener -n SERVER1

With the latest 10.2.0.3 patchset a lot of this goes away BUT the VIP still has to be restarted with the nodeapps command above AND the listener also gets stopped so must also still be restarted. Seems like a small network blip on your public VIP interface is still going to cause your RAC cluster some pain. Makes bonding your interfaces seem even more important.

5 Responses

  1. Alex Gorbachev Says:

    That’s a good example. It’s a bit tricky and every shop should go through those failures in test and have procedure how to recover.

    By the way, you can also use crs_stop with force option to stop resource stuck in unknown state. Sometimes, srvctl will just be useless.

    I also mentioned this new VIP dependency removal in 10.2.0.3 with few details. Just in case someone finds it interesting.

  2. Philip Newlan Says:

    Actually Alex is wrong. Whereas crs_stop (with force) may appear to work. The use of crs_stop (and many other crs_ commands) is unsupported by Oracle. Check chapter 14 of the Oracle Clusterware and RAC Admin guide. srvctl is the recommended interface.

  3. Oleksandr Denysenko Says:

    Hi.

    crs_xxx tools are definitely not supported by Oracle
    as documented in specified manual.
    But if you need to quickly failback VIP just use
    crs_relocate ora.server1.vip

    Oleksandr

  4. MSN Says:

    Dear All
    Hi
    Is there any solution to solve this problem?
    After installing 10.2.0.3 patch i still have this problem!!!
    Thank you in advance.
    bye

  5. jason Says:

    Hi MSN,

    That is interesting. The problem we saw above was at our business continuity site running 10.2.0.2.

    We then upgraded our production site to 10.2.0.3 before we did any changes to the switches at the production site.

    We did not encounter a problem with 10.2.0.3 running on rhel u3 x86-64.

    What platform are you finding the instance crashing with the loss of network connectivity?

    jason.

Leave a Comment

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.

Recent Posts

Highest Rated

Categories

Archives

Meta: