random technical thoughts from the Nominet technical team

Disappointed by Plone

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading ... Loading ...
Posted by jay on Aug 30th, 2005

I’ve just spent several evenings learning Plone, which is a Content Management System (CMS) written on top of Zope, which is a web application server written in Python. Most of our software is written in Java and at some point we will switch to a Java based CMS, which integrates well with our web development effort. So even though we’ve no intentions to use Plone, I gave it a look since it has cropped up in a number of different places recently.

On first sight Plone looks good. Creating my first page was simple, the code generated is clean and it works as expected. However things soon deteriorate.

Problems
There are a number of areas in which it appears that Plone has evolved without architectural planning. For example Plone was initially designed for a community portal, where members join and create their own pages, and this heritage still shows. To make it do something else means turning off all of that side before you start.

Another problem is the lack of a single management console. Some things are done in Plone, which has the start of a decent console. but some things are done in the Zope Management Interface, which is prety grim. Then there are some things that can only be done through the filesystem. (Things may be about to improve here as there is a new version of Zope, which is a major rewrite but not yet integrated with Plone. Hopefully this will be a lot better.)

There are lots of unused bits just lying around but without any discernible plan to deprecate them and tidy them up. In other places the system is just plain inconsistent.

On the bright side though, some things are very easy. In particular creating static content directly through Plone is easy. Forms are also easy and straightforward, but as they are created in ZMI their relationship to the structure of the static content is tenuous. On the other hand, creating new content types is difficult and requires much more low-level hacking than a CMS should need.

Books and Documentation
The one thing that really lets it down is the documentation, which is truly awful. Even the books are well below the normal standard of technical books. To help me get up to speed on Plone I read two books, The Definitive Guide to Plone and Building Websites with Plone. Both of these were of much lower quality than I expected.

The first book tries to follow a considered plan but it is just plain shoddy with lots of little failings that make it a nightmare to follow. For example, in one place it describes an operation to be carried out in a particular location, but the screenshots show it in a different place. This book even has mistakes in the code.

The latter book is no better. It covers a much wider range of subjects than the first book, but does so without any understanding of what’s going on in the head of the reader, making it thoroughly confusing. This book is very technical, but in way that just describes the technology rather than explaining it.

Both of these books look like normal technical books. For example they both try to build an example site using the concepts learnt, but they do it in such a poor fashion they need not have bothered. This ‘half-finished and badly thought out’, impression that I got from these books is exactly the same that I get from Plone.

Conclusion
I’m sure Plone can be very good for specific types of site, such as community portals or ones with lots of static content and forms. Plone also has a strong development community so it may one day live up to its promise.

Overall however, Plone is exceptionally difficult to use, has all sorts of issues and is a long way off meeting my expectations of a CMS. That’s not to say that you can’t build a good site in it, but it will take a superhuman effort.

iCalendar for an On Call Rota

1 Star2 Stars3 Stars4 Stars5 Stars (1 votes, average: 4 out of 5)
Loading ... Loading ...
Posted by jad on Aug 23rd, 2005

A few weeks ago I wrote a blog about server and network monitoring systems. I noted at the time that there seemed to be no simple rota management systems. I just wanted something that provides a simple web interface to a calendar that can be used with smstools to control who gets paged. Since then I have been playing with iCalendar files as a way to do this. iCalendar is defined in a series of RFC’s 2445, 2446, 2447 and 3283.

Building an iCalendar server is easy using apache and mod_dav. There are many discussions on how to do it the one I looked at can be found here.

There are lots of iCalendar compatable clients including iCal on the Mac and Sunbird from Mozilla. Sunbird seems to be more feature rich, in particular, it allows you to upload changes to calendars you are subscribed to. iCal only lets you change calendars you have published.

In order for the network monitoring system to use the calendar and page the correct person you need a bit of perl. There are several perl modules that claim to be able to parse iCalendar files. A quick web/CPAN search find lots of references to Net::iCal and Date::iCal however both of these seem to have had no work done on them recently. There are a few more on CPAN and the one with the most recent updates is iCal::Parser. This is an example script showing how easy it is to use iCal::Parser to read an iCalendar file and find out who is on call on the 30th Aug 2005.

#!/usr/bin/perl -w

use iCal::Parser;
my $file = "OnCall.ics";
my $parser=iCal::Parser->new();
my $hash=$parser->parse($file);
my $day = $hash->{events}{2005}{8}{30};
while ( my ($uids, $event) = each (%$day) ) {
print $event->{SUMMARY} . "n";
}

This script is only an example don’t use it in production :)

If you wanted a web interface I guess you could play with PHP iCalendar but I am happy using Sunbird as a client. We are going to try this out for our oncall system and see how well it works…

Deleting keys from a PGP keyserver

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading ... Loading ...
Posted by sion on Aug 8th, 2005

The Problem:

We want to maintain a keyserver with the PGP keys of our customers. We should be able to add and delete keys from it as they are added to, and marked as “old” in, a database. There should be no human process required other than maintaining the database (which contains ASCII-armoured keys).

Attempt 1; using the PGPsdk (Version 1.7.1).

We have a script which builds, from scratch, a local keyring according to the database. So the first thing we tried was to modify this script to interact with a keyserver instead.

(ASIDE: There are a number of problems with using the PGPsdk. Routines are, in themselves, fairly well documented. However, there is no real indication as to how it all fits together. So, once you move away from something which is in their example code then you are really on your own.)

Now, in order to be able to delete keys you need an LDAP keyserver. We initially set up the PGP keyserver v7.0 provided by Networks Associates Technology, Inc. on Solaris8 (and windows).

Okay, so starting with the ability to add keys eventually we got something working. However, it would reject some keys when trying to add them to the keyserver. We did not actually get to the bottom of this one; but a quick look suggested that only “PGPsdk 2.0.1″ type keys worked… We can not say this for certain; what we can say though is that ALL the keys were successfully added to a keyfile built at the same time.

Putting that to one side we looked at deleting keys from the keyserver. This proved even less fruitful than adding keys did. The issue here is connecting with sufficient privileges to delete keys, which means connecting as an administrator. From the keyserver point of view it looks like you can assign connections from certain hosts/IP addresses as having administrator privileges; or you can use a shared key.

Despite all attempts we never managed to get this working. The only 2 ways we managed to delete keys was by 1) using the windows PGP client and 2) taking the nuclear option and deleting the keyservers data files. (NOTE: The PGPsdk has changed beyond all recognition between the version that we are using and the current version. We assume that the current keyserver and clients use the new sdk.)

Attempt 2; using the command line.

With the same keyservers we tried using the PGP and GnuPG command lines. Various versions were used but mostly PGP v6.5.8 and GnuPG v1.2.5; both on Linux. Oddly, trying the PGP command line does not work as it claims that whatever key you are trying to delete does not exist. The server logs show subtly different traffic for this transaction compared to a (successful) deletion from the windows client. gpg was even less successful; executing the commands on the same machine as the keyserver did not seem to help either.

We got in contact with the PGP Corporation about the new version of their command line. (Thinking that if it is built with the same version of the PGPsdk as the keyserver that it will have more chance of success.) To date they have not got back in touch (two months at the time of writing).

Attempt 3; fight the power.

Maybe a different keyserver could solve our problems? There are a number of open source keyservers out there; we tried “pks” and “cks” with little success.

There are other keyservers available but you have to draw the line somewhere.

Conclusion.

Because of the way that keyservers synchronise between themselves deleting keys is often futile; they get written back from other keyserver(s). This has led to little or no requirement for being able to delete keys (there are processes in place for the revocation of keys). So deletion is perhaps the least developed aspect of keyserver functionality.

The simplest way to achieve key deletion (without human intervention) is to delete ALL of the keys and rebuild the keyserver from scratch. This takes ~1/2 an hour for our 3,500 keys, which stops us from doing it during work hours.

Patchset Problems

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading ... Loading ...
Posted by jason on Aug 8th, 2005

This is the definition of an Oracle patchset as given by Oracle: “A patchset is a tested and integrated set of product fixes. Patch sets provide bug fixes only; they do not include new functionality and they do not require certification on the target system”

We have been attempting to perform an Oracle database upgrade 10.1.0.3 -> 10.1.0.4, that is applying the 10.1.0.4 patchset. I found the statement above to have been well and truly overlooked in the 10.1.0.4 patchset.

As stated elsewhere, we are running a 2-node Oracle RAC cluster as our primary database platform. This went live around March 2004, at version 9.2.0.4. With the 9i version of Oracle you had to run some form of vendor clusterware to perfrom the clustering of the host servers. We chose to go with veritas Storage Foundation for Oracle RAC, version 4.0. Around October we upgraded to 10.1.0.3 and at this time had to upgrade the veritas version to 4.0 MP1 as this was the supported combination for 10g.

Judging from the definition of a patchset above, I was hoping that installing the 10.1.0.4 patchset would be a relatively painless exercise. Upgrading a RAC cluster first off all involves upgrading the CRS daemons, which are new in 10g. These live in a seperate directory structure and have to be upgraded first. You must make sure the servers in the cluster can rsh to each other AND to themselves. You must also run the patchset installer from the node you originally installed from! You can tell which node this is by looking at the order the nodes appear in the installer, the first node is in effect the “primary” node and the one on which you must run the patchset installer on.

After the CRS upgrade is done you will find that there is new functionality included with the CRS upgrade, this is a whole new daemon that is running. Under 10.1.0.3 we have the following running:

oracle  6949  6847  0   Feb 24 ?        0:00
/var/opt/oracle/product/crs/bin/evmlogger.bin -o
/var/opt/oracle/product/crs/ev
oracle  6847     1  0   Feb 24 ?        0:31
/var/opt/oracle/product/crs/bin/evmd.bin
oracle  6901  6848  0   Feb 24 ?        0:00 su -c
/var/opt/oracle/product/crs/bin/ocssd  || exit 137
oracle  6906  6901  0   Feb 24 ?       81:47
/var/opt/oracle/product/crs/bin/ocssd.bin
root  6852     1  0   Feb 24 ?       184:37
/var/opt/oracle/product/crs/bin/crsd.bin

But under 10.1.0.4 we have an additional daemon:

oracle 27353  1253  0   Aug 05 ?        0:00
/var/opt/oracle/product/crs/bin/evmlogger.bin -o
/var/opt/oracle/product/crs/ev
root  1257     1  0   Aug 03 ?        0:03
/var/opt/oracle/product/crs/bin/crsd.bin
oracle  1253     1  0   Aug 03 ?        0:20
/var/opt/oracle/product/crs/bin/evmd.bin
oracle 27473 27472  0   Aug 05 ?        0:00 /bin/sh -c
/var/opt/oracle/product/crs/bin/ocssd  || exit $?
oracle 27472 27429  0   Aug 05 ?        0:00 su -c /bin/sh -c
'/var/opt/oracle/product/crs/bin/ocssd  || exit $?'
oracle 27453 27449  0   Aug 05 ?        0:01
/var/opt/oracle/product/crs/bin/oclsmon.bin
oracle 27474 27473  0   Aug 05 ?        4:00
/var/opt/oracle/product/crs/bin/ocssd.bin
oracle 27449 27426  0   Aug 05 ?        0:00 su -c
/var/opt/oracle/product/crs/bin/oclsmon  || exit $?

That being said, the 10.1.0.4 patches of CRS seem to be more robust and appear to start quite a bit faster than in 10.1.0.3. After CRS is upgraded you now patch the oracle database server home directory, this was straightforward enough, but after you have done this you must startup the database and run some sql to patch the actual database. This proved to be a problem as the database would not start:

SQL> startup
ORA-32004: obsolete and/or deprecated parameter(s) specified
ORA-27546: Oracle compiled against IPC interface version %s.%s found
version %s.%s

While in the alert log I could see:

Oracle instance running with ODM: VERITAS 4.0 ODM Library,
Version 1.1
cluster interconnect IPC library is incompatible
with this version of Oracle
Oracle interface version information 2.4
cluster IPC library version information 2.3

So the 10.1.0.4 patchset has changed IPC version that it is compatible with. At first I attempted to apply a veritas patch that was designed for Oracle 9..2.0.6, which changes how stringent the IPC version checking is. This did work and enabled the database to startup, but when I started both nodes with the cluster_database parameter set to TRUE the following ORA-07445 errors appeared in the alert log:

ORA-07445: exception encountered: core dump [ksxpirqh()+4] [SIGSEGV]
[Address not mapped to object] [0x0000000BF] [] []

This was happening with monotonous frequency every 5 minutes. The only option is to upgrade to Storage Foundation for Oracle RAC 4.1.

Another interesting feature of the 10.1.0.4 patchset is that it breaks automatic statistics gathering with the following errors in the alert log:

GATHER_STATS_JOB encountered errors.  Check the trace file.
Wed Jul 27 22:44:27 2005
Errors in file /opt/oracle/admin/NOM/bdump/bdbold_j000_5513.trc:
ORA-00904: "T2"."SYS_DS_ALIAS_2": invalid identifier

Apparently this is fixed in 10.2. Basically the 10.1.0.4 not only includes new functionality and features, it also does not appear to be a well tested set of product fixes.

Oracle 10g client on MacOS X Tiger

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading ... Loading ...
Posted by chris on Aug 7th, 2005

I’ve recently got my first ever Mac - a new PowerBook and I’ve been busy setting things up on it. One thing I needed was database connectivity. At the moment, Oracle’s 10g Client officially only supports MacOS up to version 10.3.6. So to get it working under Tiger, you need to tweak some things. There’s a discussion here:

http://forums.oracle.com/forums/thread.jspa?forumID=134&tstart=0&messageID=&threadID=282377&trange=15984406

Basically it boils down to:

  1. Add an entry for yourself in your /etc/hosts file (mad, I know, but you can remove it afterwards)
  2. Do sudo gcc_select 3.3 before you run the install. This makes it use the right version of gcc.

I did all of this and the install ran without problem. I then had the very odd situation where tnsping worked, but SQL*Plus did not. It turned out to be an issue with line endings. My tnsnames file was still in DOS mode after copying it from my Windows box. I changed to UNIX line endings and it worked…

Incorrect HELO/EHLO information is widespread

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading ... Loading ...
Posted by jay on Aug 6th, 2005

About a year ago we ran a test for a couple of hours to see what incoming emails we could reject because they had incorrect HELO/EHLO commands. Our criteria for rejection were:

  1. If they specified a hostname (FQDN or not) but it did not resolve.
  2. If they specified a hostname which did resolve but the address it resolved to was different from the IP address of the connecting host.
  3. If they specified a hostname that did resolve but was different from the hostname obtained by doing a reverse lookup on the IP address of the connecting host (yes I know this was a bit harsh).
  4. If they specified an IP address instead of a hostname and this was different from the IP address of the connecting host.
  5. If they specified nothing or nothing intelligible.

To our surprise in just the small test period we had hundreds of rejected emails. We did an analysis of the headers and contacted about fifty of the sources to came up with the following characterisations, in no significant order:

  • a hostname was specified which was our domain name. This was pretty frequent. This seems to be a clear indicator of spam. Doing a reverse lookup on the IP address invariably gives a hostname from an ISP DHCP pool.
  • a hostname was specified that that was the hostname of our mail server. This was fairly rare. Again this seems to be a clear indicator of spam.
  • an IP address was specified that was the IP address of our mail server. This was pretty frequent and again this seems to be a clear indicator of spam.
  • a hostname was specified but this was a single label. Normally when the reverse lookup was done the hostname obtained bore no relation to the single label specified.
  • a hostname was specified that ended with ‘.local’. From examination these all seem to have been Windows Domain Controllers (judging by the rest of the hostname).
  • a hostname of ‘localhost’ was specified.
  • an IP address of ‘127.0.0.1′ was specified.
  • a hostname was specified but it was a domain name, with no hostname part. Again there was no correlation between this and the hostname returned by a reverse lookup.
  • a hostname was specified but it could not be resolved. In some cases the hostname given by the reverse lookup was very similar to that specified. This nearly always indicated that the hostname that had been presented was the internal name of the host, which is why it could not resolve.
  • an IP address from a private range (RFC 1918) was specified.
  • a hostname was specified that resolved but was different from the hostname obtained from a reverse lookup on the IP address of the connecting host. Interestingly very little of this was spam, it was nearly always genuine but the specified hostname was from a related domain. Take the example where a company uses .co.uk for the public names but net.uk for the server names. The mail server specifies the co.uk name but it actually resolves on a reverse lookup to a net.uk name.

Unfortunately we didn’t have the time to do any statistical analysis of these results but hopefully some time we will get a chance.

Oracle Logical Standby part III

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading ... Loading ...
Posted by jason on Aug 1st, 2005

After 1 month of work on the logical standby we were seeing the following in our Oracle alert logs:

04-DEC-04 11:40:59
ORA-16222: automatic Logical Standby retry of last action
04-DEC-04 11:40:59
ORA-00354: corrupt redo log block header
ORA-00353: log corruption near block 49284 change 182154666 time
12/04/2004 10:20:34
ORA-00334: archived log: '/db5/oradata/NOM/standbyredo2_7.log'
04-DEC-04 11:41:03
ORA-16111: log mining and apply setting up
06-DEC-04 02:31:04
ORA-00354: corrupt redo log block header
ORA-00353: log corruption near block 522832 change 182659579

This causes the logical standby to terminate applying redo causing it to lag behind the primary. Our goal for the logical standby is a maximum lag of 1 minute. When it is all working fine, I can see updates from the primary database being available to query in under 20 seconds. But this is no good if it will on ocassion be hours behind.

From this error the first thought was some form of disk corruption, but we are running the logical standby on a fully redundant RAID array and no disk errors were being reported.

The next attempt at a fix by Oracle was attributing this problem to Bug:3230985 which apparently is fixed in 10.2 and 10.1.0.4. We requested a backport of this fix to 10.1.0.3, which is what we were running. Six days later however Oracle were claiming that actually Bug: 3230985 was in fact fixed in Oracle version 9.2.0.3.0 which meant that we were now looking at a whole new bug that development would have to find a fix to, rather than just backporting a previous fix.

A couple of weeks now pass with trace files being sent into Oracle support whenever they are asked for and corruption happening with alarming regularity. After a few more days wrangling, the severity is upped to a sev 1 case. Oracle are interested in comparing the redo log file on the primary compared to the standby redo logfile that is getting the corruption errors. The dd command is used to get segments of both redo logfiles around the corruption block that gets mentioned in the alert log.

While investigating the redo corruption the following is observed in the alert log:

Fri Dec 24 22:08:23 2004
LSP0: apply server 2 blocked on server 1
LSP0: [latch free] address=3c481b948, number=3e, tries=0
LSP0: apply server 1 rolled back
LSP0: can't recover from rollback of multi-chunk txn, aborting..
LOGSTDBY Reader P003 pid=31 OS id=23880 stopped
LOGSTDBY Reader P003 pid=31 OS id=23880 stopped
LOGSTDBY Reader P005 pid=33 OS id=23884 stopped
LOGSTDBY Reader P006 pid=34 OS id=23886 stopped
LOGSTDBY Reader P007 pid=35 OS id=23888 stopped
LOGSTDBY Reader P008 pid=36 OS id=23890 stopped
LOGSTDBY Reader P004 pid=32 OS id=23882 stopped

This multi-chunk txn aborting… would also prove to be a residual pain. The standby processes recover from this, but it still causes a time delay between primary in Oxford and Standby database in London.

A few days after Christmas another log corruption episode occurs and the usual trace files and dd of redo logfiles are uploaded to Oracle. A patch was produced on December 30th, this was now Bug: 4002681 Log apply terminates due to corrupt redo log block header. Apparently this was a false corruption problem due to a timing problem. The standby database was attempting to read further into the standby redo log than had actually been written so it looked like corruption. Upon a restart everything would appear fine, as there was no genuine corruption.

So I thought it was problem solved and applying the patch would fix things. Except that straight away the following appeared in the alert log:

Tue Jan 4 10:06:47 2005
Errors in file /opt/oracle/admin/NOM/bdump/standby1_p000_6738.trc:
ORA-00354: corrupt redo log block header
ORA-00353: log corruption near block 24414 change 204468730
time 01/04/2005 09:17:16
ORA-00334: archived log: '/db4/oradata/NOM/standbyredo1_7.log'
LOGSTDBY status: ORA-00354: corrupt redo log block header
ORA-00353: log corruption near block 24414 change 204468730
time 01/04/2005 09:17:16
ORA-00334: archived log: '/db4/oradata/NOM/standbyredo1_7.log'
Tue Jan 4 10:06:48 2005
Errors in file /opt/oracle/admin/NOM/bdump/standby1_lsp0_6731.trc:
ORA-12801: error signaled in parallel query server P000,
instance ld-1:STANDBY1 (1)
ORA-00354: corrupt redo log block header
ORA-00353: log corruption near block 24414 change 204468730
time 01/04/2005 09:17:16
ORA-00334: archived log: '/db4/oradata/NOM/standbyredo1_7.log'
Tue Jan 4 10:06:48 2005
TLCR process death detected. Shutting down TLCR

The patch 4002681 did however increase the length of time we went between log corruption events happening. Several Weeks now passed as Oracle demanded more of the same information that had already been passed to them, like more trace files and more dd’s of the corrupt redo logfiles. We had quite a few bugs now associated with this problem:

- 07-Dec-2004:
~~~~~~~~~~~~
First analysis showed Bug is absolutely identical to Bug 3230985
Bug label states: Fixed In Ver: 10.2.
Backport requested for 10.1.0.3, approved by BDE
Dev. confirms that this Bug is already included within 10.1.0.3

- 15-DEC-2004:
~~~~~~~~~~~~
A new Bug was opened (Bug 4069374).
This is the one we expect a final solution (if it does not turn
out that any other issue might be the root cause)

- 24-DEC-2004:
~~~~~~~~~~~~
While working on 4069374, BDE suspects that Bug 4002681 might be
a possible candidate for this problem.

- 24-DEC-2004:
~~~~~~~~~~~~
A Backport for 4002681 on top of 10.1.0.3 for Solaris 64bit had
to be requested!

A lot of speculation now occurrred as to what exactly was going on

  • at the moment the corruption in the standby redo logfile is signalled the recovery stops of the standby
  • the trigger to restart the recovery seems to be a redo log switch at the primary, at which it seems that the complete redo log file is sent over to the standby and processed
  • the problem is that it is undetermined how long it will takes before the recovery is restarted at the standby database, as informed it should be in sync with primary with a minimal delay less than 1 minute
  • the question to address at this moment (having still the standby redo log corruption) is how to make sure that the recovery of the standby is restarted direct after having the corrupted standby redo log file
  • the initial frequency of the problem was: within 5 days, the current frequency seems to be within 10.5 days, the load on the primary database has not decreased so cannot be an explanation for the decrease of the frequency

It also looked like the redo corruption was occurring only when a buffer full condition on the primary occurred. The redo from the primary is actually buffered when using asynchronous redo shipping and if for some reason the primary fails to send this data for such a period of time that the buffer fills up, the primary database will only then try and send the redo data upon a log switch on the primary. The value of this memory buffer was increased quite substantially on the primary and it became apparent the corruption would also occur without a buffer full condition as well.

A lot of time seemed to pass while Oracle attempted to figure out a resolution to the problem. January turned to February and still no resolution was in sight. By the 25th February Oracle developers claimed to have reproduced the issue, we were now working on bug: 4130275. Unfortunately though March turned into April before the fix was available, this was a full 3 months after the application of the first patch, when we knew there was still a problem.

However A week after the 2nd patch was applied we had the following in the alert log:

LSP0: apply server 3 rolled back
LSP0: can't recover from rollback of multi-chunk txn, aborting..
LOGSTDBY Reader P003 pid=32 OS id=21429 stopped
LOGSTDBY Reader P003 pid=32 OS id=21429 stopped
LOGSTDBY Reader P004 pid=33 OS id=21432 stopped
LOGSTDBY Reader P005 pid=34 OS id=21436 stopped
LOGSTDBY Reader P006 pid=35 OS id=21439 stopped
LOGSTDBY Reader P007 pid=36 OS id=21442 stopped
LOGSTDBY Reader P008 pid=37 OS id=21447 stopped

Mon Apr 11 03:01:55 2005

LOGSTDBY status: ORA-16222: automatic Logical Standby
retry of last action

LOGSTDBY status: ORA-16111: log mining and apply setting up

Which obviously was the multi-chunking issue we had seen several months previoiusly. This was still causing us a delay as all the apply process would be stopped and even though it restarted it would take a considerable time to get back up to date again. The last patch did however seem to fix the corruption issues.

More months passed and April became May. The whole point of the standby was try and run our DAC of so that any additional load would not be added to the primary server used for the automaton. For this to be viable the standby had to be very much up-to-date, but we could not go more than 1 week without freezing and lagging behind the primary. No real progress was made in diagnosing this issue, many leads were followed, many trace files and systemstate dumps were uploaded but no real solution was found. May turned into June. By the end June, we had been trying to get our logical standby fixed for some 8 months. It was still broken with no real resolution in sight. Oracle seemed to be pinning their hopes on an upgrade to 10.1.0.4. There were some tips as to tuning a logical standby and these did help a little:

DO NOT Change the _MAX_LOG_LOOKBACK parameter

Set Transaction Consistency to NONE

Have a minimum 20 parallel server processes

Set the MAX_SGA parameter to a much larger value than default

We played around a lot with _MAX_LOG_LOOKBACK to try and tune the time a recovery would take once a freeze had occured to try and minimise the impact of the problem, but this parameter when set to a very low value just exacerbated the freeze problem and made it occur much more frequently. Transaction consistency has a very large impact on the performance of a logical standby, with NONE being the most performant. The parallel server processes value also increases performance as does increasing the MAX_SGA, which greatly reduces the chance of transactions being paged out.

Logical standby is a good idea in theory, and when it worked was highly performant, but until the stability issues are resolved is of little use in a production environment. I think the greatest dissappointment was the length of time taken to diagnose the various issues.

Recent Posts

Highest Rated

Categories

Archives

Meta: