random technical thoughts from the Nominet technical team

Oracle Logical Standby part III

1 Star2 Stars3 Stars4 Stars5 Stars (1 votes, average: 3 out of 5)
Loading ... Loading ...
Posted by jason on Aug 1st, 2005

After 1 month of work on the logical standby we were seeing the following in our Oracle alert logs:

04-DEC-04 11:40:59
ORA-16222: automatic Logical Standby retry of last action
04-DEC-04 11:40:59
ORA-00354: corrupt redo log block header
ORA-00353: log corruption near block 49284 change 182154666 time
12/04/2004 10:20:34
ORA-00334: archived log: '/db5/oradata/NOM/standbyredo2_7.log'
04-DEC-04 11:41:03
ORA-16111: log mining and apply setting up
06-DEC-04 02:31:04
ORA-00354: corrupt redo log block header
ORA-00353: log corruption near block 522832 change 182659579

This causes the logical standby to terminate applying redo causing it to lag behind the primary. Our goal for the logical standby is a maximum lag of 1 minute. When it is all working fine, I can see updates from the primary database being available to query in under 20 seconds. But this is no good if it will on ocassion be hours behind.

From this error the first thought was some form of disk corruption, but we are running the logical standby on a fully redundant RAID array and no disk errors were being reported.

The next attempt at a fix by Oracle was attributing this problem to Bug:3230985 which apparently is fixed in 10.2 and 10.1.0.4. We requested a backport of this fix to 10.1.0.3, which is what we were running. Six days later however Oracle were claiming that actually Bug: 3230985 was in fact fixed in Oracle version 9.2.0.3.0 which meant that we were now looking at a whole new bug that development would have to find a fix to, rather than just backporting a previous fix.

A couple of weeks now pass with trace files being sent into Oracle support whenever they are asked for and corruption happening with alarming regularity. After a few more days wrangling, the severity is upped to a sev 1 case. Oracle are interested in comparing the redo log file on the primary compared to the standby redo logfile that is getting the corruption errors. The dd command is used to get segments of both redo logfiles around the corruption block that gets mentioned in the alert log.

While investigating the redo corruption the following is observed in the alert log:

Fri Dec 24 22:08:23 2004
LSP0: apply server 2 blocked on server 1
LSP0: [latch free] address=3c481b948, number=3e, tries=0
LSP0: apply server 1 rolled back
LSP0: can't recover from rollback of multi-chunk txn, aborting..
LOGSTDBY Reader P003 pid=31 OS id=23880 stopped
LOGSTDBY Reader P003 pid=31 OS id=23880 stopped
LOGSTDBY Reader P005 pid=33 OS id=23884 stopped
LOGSTDBY Reader P006 pid=34 OS id=23886 stopped
LOGSTDBY Reader P007 pid=35 OS id=23888 stopped
LOGSTDBY Reader P008 pid=36 OS id=23890 stopped
LOGSTDBY Reader P004 pid=32 OS id=23882 stopped

This multi-chunk txn aborting… would also prove to be a residual pain. The standby processes recover from this, but it still causes a time delay between primary in Oxford and Standby database in London.

A few days after Christmas another log corruption episode occurs and the usual trace files and dd of redo logfiles are uploaded to Oracle. A patch was produced on December 30th, this was now Bug: 4002681 Log apply terminates due to corrupt redo log block header. Apparently this was a false corruption problem due to a timing problem. The standby database was attempting to read further into the standby redo log than had actually been written so it looked like corruption. Upon a restart everything would appear fine, as there was no genuine corruption.

So I thought it was problem solved and applying the patch would fix things. Except that straight away the following appeared in the alert log:

Tue Jan 4 10:06:47 2005
Errors in file /opt/oracle/admin/NOM/bdump/standby1_p000_6738.trc:
ORA-00354: corrupt redo log block header
ORA-00353: log corruption near block 24414 change 204468730
time 01/04/2005 09:17:16
ORA-00334: archived log: '/db4/oradata/NOM/standbyredo1_7.log'
LOGSTDBY status: ORA-00354: corrupt redo log block header
ORA-00353: log corruption near block 24414 change 204468730
time 01/04/2005 09:17:16
ORA-00334: archived log: '/db4/oradata/NOM/standbyredo1_7.log'
Tue Jan 4 10:06:48 2005
Errors in file /opt/oracle/admin/NOM/bdump/standby1_lsp0_6731.trc:
ORA-12801: error signaled in parallel query server P000,
instance ld-1:STANDBY1 (1)
ORA-00354: corrupt redo log block header
ORA-00353: log corruption near block 24414 change 204468730
time 01/04/2005 09:17:16
ORA-00334: archived log: '/db4/oradata/NOM/standbyredo1_7.log'
Tue Jan 4 10:06:48 2005
TLCR process death detected. Shutting down TLCR

The patch 4002681 did however increase the length of time we went between log corruption events happening. Several Weeks now passed as Oracle demanded more of the same information that had already been passed to them, like more trace files and more dd’s of the corrupt redo logfiles. We had quite a few bugs now associated with this problem:

- 07-Dec-2004:
~~~~~~~~~~~~
First analysis showed Bug is absolutely identical to Bug 3230985
Bug label states: Fixed In Ver: 10.2.
Backport requested for 10.1.0.3, approved by BDE
Dev. confirms that this Bug is already included within 10.1.0.3

- 15-DEC-2004:
~~~~~~~~~~~~
A new Bug was opened (Bug 4069374).
This is the one we expect a final solution (if it does not turn
out that any other issue might be the root cause)

- 24-DEC-2004:
~~~~~~~~~~~~
While working on 4069374, BDE suspects that Bug 4002681 might be
a possible candidate for this problem.

- 24-DEC-2004:
~~~~~~~~~~~~
A Backport for 4002681 on top of 10.1.0.3 for Solaris 64bit had
to be requested!

A lot of speculation now occurrred as to what exactly was going on

  • at the moment the corruption in the standby redo logfile is signalled the recovery stops of the standby
  • the trigger to restart the recovery seems to be a redo log switch at the primary, at which it seems that the complete redo log file is sent over to the standby and processed
  • the problem is that it is undetermined how long it will takes before the recovery is restarted at the standby database, as informed it should be in sync with primary with a minimal delay less than 1 minute
  • the question to address at this moment (having still the standby redo log corruption) is how to make sure that the recovery of the standby is restarted direct after having the corrupted standby redo log file
  • the initial frequency of the problem was: within 5 days, the current frequency seems to be within 10.5 days, the load on the primary database has not decreased so cannot be an explanation for the decrease of the frequency

It also looked like the redo corruption was occurring only when a buffer full condition on the primary occurred. The redo from the primary is actually buffered when using asynchronous redo shipping and if for some reason the primary fails to send this data for such a period of time that the buffer fills up, the primary database will only then try and send the redo data upon a log switch on the primary. The value of this memory buffer was increased quite substantially on the primary and it became apparent the corruption would also occur without a buffer full condition as well.

A lot of time seemed to pass while Oracle attempted to figure out a resolution to the problem. January turned to February and still no resolution was in sight. By the 25th February Oracle developers claimed to have reproduced the issue, we were now working on bug: 4130275. Unfortunately though March turned into April before the fix was available, this was a full 3 months after the application of the first patch, when we knew there was still a problem.

However A week after the 2nd patch was applied we had the following in the alert log:

LSP0: apply server 3 rolled back
LSP0: can't recover from rollback of multi-chunk txn, aborting..
LOGSTDBY Reader P003 pid=32 OS id=21429 stopped
LOGSTDBY Reader P003 pid=32 OS id=21429 stopped
LOGSTDBY Reader P004 pid=33 OS id=21432 stopped
LOGSTDBY Reader P005 pid=34 OS id=21436 stopped
LOGSTDBY Reader P006 pid=35 OS id=21439 stopped
LOGSTDBY Reader P007 pid=36 OS id=21442 stopped
LOGSTDBY Reader P008 pid=37 OS id=21447 stopped

Mon Apr 11 03:01:55 2005

LOGSTDBY status: ORA-16222: automatic Logical Standby
retry of last action

LOGSTDBY status: ORA-16111: log mining and apply setting up

Which obviously was the multi-chunking issue we had seen several months previoiusly. This was still causing us a delay as all the apply process would be stopped and even though it restarted it would take a considerable time to get back up to date again. The last patch did however seem to fix the corruption issues.

More months passed and April became May. The whole point of the standby was try and run our DAC of so that any additional load would not be added to the primary server used for the automaton. For this to be viable the standby had to be very much up-to-date, but we could not go more than 1 week without freezing and lagging behind the primary. No real progress was made in diagnosing this issue, many leads were followed, many trace files and systemstate dumps were uploaded but no real solution was found. May turned into June. By the end June, we had been trying to get our logical standby fixed for some 8 months. It was still broken with no real resolution in sight. Oracle seemed to be pinning their hopes on an upgrade to 10.1.0.4. There were some tips as to tuning a logical standby and these did help a little:

DO NOT Change the _MAX_LOG_LOOKBACK parameter

Set Transaction Consistency to NONE

Have a minimum 20 parallel server processes

Set the MAX_SGA parameter to a much larger value than default

We played around a lot with _MAX_LOG_LOOKBACK to try and tune the time a recovery would take once a freeze had occured to try and minimise the impact of the problem, but this parameter when set to a very low value just exacerbated the freeze problem and made it occur much more frequently. Transaction consistency has a very large impact on the performance of a logical standby, with NONE being the most performant. The parallel server processes value also increases performance as does increasing the MAX_SGA, which greatly reduces the chance of transactions being paged out.

Logical standby is a good idea in theory, and when it worked was highly performant, but until the stability issues are resolved is of little use in a production environment. I think the greatest dissappointment was the length of time taken to diagnose the various issues.

Leave a Comment

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.

Recent Posts

Highest Rated

Categories

Archives

Meta: