Oracle Logical Standby part II
Previously I talked about the transition to 10g and the initial setting up of a logical standby. Getting up and running was NOT the hard part. On a logical standby database you have the following processes that take the shipped log information and apply them to the database:
- coordinator
- reader
- builder
- preparer
- analyzer
- applier
A good diagram of the whole process is available here
The first problems that were encountered were that the apply processes kept dying, but the main problem seemed to be the same archive log being resent over and over again from the primary:
Fri Nov 19 09:07:00 2004 RFS[7]: Archived Log: ‘/var/opt/oracle/NOM/arch/arch1789519677362.arc’ Committing creation of archivelog ‘/var/opt/oracle/NOM/arch/arch1789519677362.arc’ RFS[7]: Archived Log: ‘/var/opt/oracle/NOM/arch/arch1788519677362.arc’ Committing creation of archivelog ‘/var/opt/oracle/NOM/arch/arch1788519677362.arc’ Fri Nov 19 09:09:07 2004 RFS[7]: Archived Log: ‘/var/opt/oracle/NOM/arch/arch1789519677362.arc’ Committing creation of archivelog ‘/var/opt/oracle/NOM/arch/arch1789519677362.arc’ RFS[7]: Archived Log: ‘/var/opt/oracle/NOM/arch/arch1788519677362.arc’ Committing creation of archivelog ‘/var/opt/oracle/NOM/arch/arch1788519677362.arc’
This continued to happen for several days, but it was an easy fix in the end and involved registering these logfiles with the standby, it seemed like the standby was convinced it did not have them. Of course one of the big selling points of 10g Dataguard is that archive gaps all get sorted automatically by the FAL process.
It did not take long for the next problem to arrive. The next issue to hit us would become a very familiar situation. Everything would appear to be working and all oracle processes would be running as you would expect. Data from the primary would continue to be shipped and the standby would happily recieve it but nothing would be getting applied:
COORDINATOR 169791268 ORA-16116: no work available READER 169711197 ORA-16127: stalled waiting for additional transactions to be applied BUILDER 169710932 ORA-16127: stalled waiting for additional transactions to be applied PREPARER 169710929 ORA-16127: stalled waiting for additional transactions to be applied ANALYZER 169710932 ORA-16117: no work available APPLIER ORA-16116: no work available APPLIER ORA-16116: no work available APPLIER ORA-16116: no work available APPLIER ORA-16116: no work available APPLIER ORA-16116: no work available
It would stall like for hour upon hour, regardless of whether a log switch occurred or not.
There are several escalation levels for an oracle tar (technical assistance request) most non critical issues, i.e. non corruption/database down issues are given a level 2. In theory level 1 will be acted upon 24 hours a day, 7 days a week until a resolution is provided, the tar will move from europe to usa and then to india as each country starts work for the day. However, Oracle managed to get this latest tar stuck in the us time zone, at severity 2 so updates would occur only once america woke up. I asked for it to be moved to europe, so Oracle move it to europe after europe closes down for the evening, so days seem to passe with the tar being owned by a region not actually on shift.
Oracle now postulated several theories, a flush of shared pool, a shutdown/restart of the database, fiddling with eagersize which is a hidden parameter affecting the number of rows that can be modified by a single transaction, apparently more of a problem in 9i under 10g touch at your peril.
The focus shifted to the sql statements that were being applied to the standby and another theory to bite the dust was that it was due to a bug (fixed in 10.1.0.4): Bug:2766894 with a workaround of creating a unique index on the problematic table that exactly matches the primary key constraint. So when you create a primary key an index is created. A primary key by definition must be UNIQUE, but oracle were telling me to create another unique index on the same columns of this table. The following was what happened:
SQL> create unique index dataguardpk on domains(key,sld,instance) tablespace hidraindx; create unique index dataguardpk on domains(key,sld,instance) tablespace hidraindx * ERROR at line 1: ORA-01408: such column list already indexed
What a surprise! This idea was quickly dropped.
At this point Oracle basically gave up with trying to get the standby applying, and decided the best way to proceed was skip the transaction we were stalled on and reinstantiate t he table. I decided it was just as quck to recreate the whole standby from fresh. The standby was recreated on the 26th November 2004 from fresh and to begin with it seemed happy again.
However by the 29th November it was as shafted as the previous incarnation. More fiddling with eagersize was requested and then large transactions were blamed. Also a possible issue was identified in Metalink Note:249361.1
Another parameter that was moddified was the maxloglookback parameter which determines how far back in the redo log the standby will look, again NOT something I advise changing, though this did enable the standby to get going again, for a short time.
By the 1st December Oracle were muttering darkly about corrupted logminer dictionary and expressing that we should again recreate the standby. At this point I kept telling them that I needed to know the underlying cause and it was not too good to have to recreate the logical standby every week.
We did recreate the logical standby for a third time, this time I insisted on having a support engineer view the whole process via Oracle Collaborative Support so we could be ceratain that I was properly creating the logical standby, which they confirmed. This “oracle approved” logical standby was caught up in the same issues as previous incarnations, but then on the 2nd December someone had the idea of turning off resourcemanager on the standby.
Tip #2 DO NOT run the logical standby with resourcemanagerplan set
Finally this produced some success and we started applying for a while, at this point we became aware of a non show-stopping bug Bug:3764009 that caused many, many timeout warnings to appear in the alert log. But we were up and running and successfully applying redo for a couple of days.
However on the 4th December the following appeared in the logs:
04-DEC-04 11:40:59 ORA-16222: automatic Logical Standby retry of last action 04-DEC-04 11:40:59 ORA-00354: corrupt redo log block header ORA-00353: log corruption near block 49284 change 182154666 time 12/04/2004 10:20:34 ORA-00334: archived log: ‘/db5/oradata/NOM/standbyredo2_7.log’
The logical standby was once frozen and not applying redo. This log corruption issue was to become a major headache.

