random technical thoughts from the Nominet technical team

When dataguard goes bad

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading ... Loading ...
Posted by jason on Dec 4th, 2006

We have been using Oracle dataguard to provide us with a replicated copy of our production database for disaster recovery. After our pain with logical standby I had hoped using the more robust and mature physical standby would lead to less pulling out of hair etc. Mostly it has been very efficient:

select name, value from v$dataguard_stats;

apply finish time             +00 00:00:00.0

apply lag                     +00 00:00:15

estimated startup time           24

standby has been open            N

transport lag                 +00 00:00:08

However what I was not expecting was this in the alert log on the standby:

ORA-07445: exception encountered: core dump [kcrarmb()+152] [SIGFPE] [Integer divide by zero][0x0085C0]

This killed the managed recovery process (MRP) which is responsible for applying the redo data from the standby redo logs. Thankfully redo continued to be sent after a log switch on the primary, though as MRP is responsible for spotting and resolving archive gaps nothing was being applied on the standby (due to MRP not running) but we also were missing an archive log. This opened up the potential of data loss, even though we have later redo data if you can’t resolve a gap you are stuck.

Searching on metalink returns absolutely no hits. What I liked best though was the response of an Oracle engineer when asked about this issue:

“I had a Look into our Knowledge and Report Database, this kind of Error has not been reported before.”

I don’t think we are running anything out of the ordinary, though it is a RAC -> RAC configuration, oh and maybe not everyone is running real time apply which is new from 10.1.

A fix was fairly easy to come by as restarting the managed recovery process then proceeds to perform gap resolution and once there is no gap new redo can happily be applied again. This however is quite an annoyance that MRP can’t just restart itself, so close monitoring of the apply lag and perhaps even the mrp background process is really required - it’s no good paying for a standby only to find on disaster it’s out of date!

Leave a Comment

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.

Recent Posts

Highest Rated

Categories

Archives

Meta: