random technical thoughts from the Nominet technical team

Oracle Logical Standby part II

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading ... Loading ...
Posted by jason on Jul 13th, 2005

Previously I talked about the transition to 10g and the initial setting up of a logical standby. Getting up and running was NOT the hard part. On a logical standby database you have the following processes that take the shipped log information and apply them to the database:

  • coordinator
  • reader
  • builder
  • preparer
  • analyzer
  • applier

A good diagram of the whole process is available here

The first problems that were encountered were that the apply processes kept dying, but the main problem seemed to be the same archive log being resent over and over again from the primary:

Fri Nov 19 09:07:00 2004 
 RFS[7]: Archived Log: ‘/var/opt/oracle/NOM/arch/arch1789519677362.arc’ 
 Committing creation of archivelog ‘/var/opt/oracle/NOM/arch/arch1789519677362.arc’ 
 RFS[7]: Archived Log: ‘/var/opt/oracle/NOM/arch/arch1788519677362.arc’ 
 Committing creation of archivelog ‘/var/opt/oracle/NOM/arch/arch1788519677362.arc’ Fri Nov 19 09:09:07 2004 
 RFS[7]: Archived Log: ‘/var/opt/oracle/NOM/arch/arch1789519677362.arc’ 
 Committing creation of archivelog ‘/var/opt/oracle/NOM/arch/arch1789519677362.arc’ 
 RFS[7]: Archived Log: ‘/var/opt/oracle/NOM/arch/arch1788519677362.arc’ 
 Committing creation of archivelog ‘/var/opt/oracle/NOM/arch/arch1788519677362.arc’

This continued to happen for several days, but it was an easy fix in the end and involved registering these logfiles with the standby, it seemed like the standby was convinced it did not have them. Of course one of the big selling points of 10g Dataguard is that archive gaps all get sorted automatically by the FAL process.

It did not take long for the next problem to arrive. The next issue to hit us would become a very familiar situation. Everything would appear to be working and all oracle processes would be running as you would expect. Data from the primary would continue to be shipped and the standby would happily recieve it but nothing would be getting applied:

COORDINATOR 169791268 ORA-16116: no work available
READER 169711197 ORA-16127:
stalled waiting for additional transactions to be applied
BUILDER 169710932 ORA-16127:
stalled waiting for additional transactions to be applied
PREPARER 169710929 ORA-16127:
stalled waiting for additional transactions to be applied
ANALYZER 169710932 ORA-16117: no work available
APPLIER ORA-16116: no work available
APPLIER ORA-16116: no work available
APPLIER ORA-16116: no work available
APPLIER ORA-16116: no work available
APPLIER ORA-16116: no work available

It would stall like for hour upon hour, regardless of whether a log switch occurred or not.

There are several escalation levels for an oracle tar (technical assistance request) most non critical issues, i.e. non corruption/database down issues are given a level 2. In theory level 1 will be acted upon 24 hours a day, 7 days a week until a resolution is provided, the tar will move from europe to usa and then to india as each country starts work for the day. However, Oracle managed to get this latest tar stuck in the us time zone, at severity 2 so updates would occur only once america woke up. I asked for it to be moved to europe, so Oracle move it to europe after europe closes down for the evening, so days seem to passe with the tar being owned by a region not actually on shift.

Oracle now postulated several theories, a flush of shared pool, a shutdown/restart of the database, fiddling with eagersize which is a hidden parameter affecting the number of rows that can be modified by a single transaction, apparently more of a problem in 9i under 10g touch at your peril.

The focus shifted to the sql statements that were being applied to the standby and another theory to bite the dust was that it was due to a bug (fixed in 10.1.0.4): Bug:2766894 with a workaround of creating a unique index on the problematic table that exactly matches the primary key constraint. So when you create a primary key an index is created. A primary key by definition must be UNIQUE, but oracle were telling me to create another unique index on the same columns of this table. The following was what happened:

SQL> create unique index dataguardpk on domains(key,sld,instance)
tablespace hidraindx;
create unique index dataguardpk on domains(key,sld,instance)
tablespace
hidraindx
*
ERROR at line 1:
ORA-01408: such column list already indexed

What a surprise! This idea was quickly dropped.

At this point Oracle basically gave up with trying to get the standby applying, and decided the best way to proceed was skip the transaction we were stalled on and reinstantiate t he table. I decided it was just as quck to recreate the whole standby from fresh. The standby was recreated on the 26th November 2004 from fresh and to begin with it seemed happy again.

However by the 29th November it was as shafted as the previous incarnation. More fiddling with eagersize was requested and then large transactions were blamed. Also a possible issue was identified in Metalink Note:249361.1

Another parameter that was moddified was the maxloglookback parameter which determines how far back in the redo log the standby will look, again NOT something I advise changing, though this did enable the standby to get going again, for a short time.

By the 1st December Oracle were muttering darkly about corrupted logminer dictionary and expressing that we should again recreate the standby. At this point I kept telling them that I needed to know the underlying cause and it was not too good to have to recreate the logical standby every week.

We did recreate the logical standby for a third time, this time I insisted on having a support engineer view the whole process via Oracle Collaborative Support so we could be ceratain that I was properly creating the logical standby, which they confirmed. This “oracle approved” logical standby was caught up in the same issues as previous incarnations, but then on the 2nd December someone had the idea of turning off resourcemanager on the standby.

Tip #2 DO NOT run the logical standby with resourcemanagerplan set

Finally this produced some success and we started applying for a while, at this point we became aware of a non show-stopping bug Bug:3764009 that caused many, many timeout warnings to appear in the alert log. But we were up and running and successfully applying redo for a couple of days.

However on the 4th December the following appeared in the logs:

04-DEC-04 11:40:59
ORA-16222: automatic Logical Standby retry of last action
04-DEC-04 11:40:59
ORA-00354: corrupt redo log block header
ORA-00353: log corruption near block 49284 change 182154666
time 12/04/2004 10:20:34
ORA-00334: archived log: ‘/db5/oradata/NOM/standbyredo2_7.log’

The logical standby was once frozen and not applying redo. This log corruption issue was to become a major headache.

IronPort and bare LF (linefeed) issues

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading ... Loading ...
Posted by jay on Jul 7th, 2005

Having moved over to our IronPort MTAs we have discovered that some people were no longer accepting delivery of some of our messages. The major symptom of this showed up in the logs as “[Errno 54] Connection reset by peer”

After a bit of searching we found one log entry where there was an error message generated by qmail which pointed to a URL that actually redirects to http://cr.yp.to/docs/smtplf.html. This very useful page explained that some MTAs send messages with lines terminated by a bare LF instead of the required CR + LF pair. One MTA identified as doing that is the IronPort.

A search on the IronPort KB (the one with ridiculous password protection) throws up an article which states:

“IronPort believes that a messaging gateway appliance should be as transparent to the messaging flow as possible and does not reject or repair messages with bare <LF> characters. This means that the behavior of the final destination messaging system with regard to improperly formatted messages (such as those with bare <LF> in them) will override. In other words, if bare <LF> messages are allowed by the destination messaging system, then AsyncOS will not block them. If bare <LF> messages are not allowed, then these will be bounced back to the sender by the IronPort appliance.”

So that narrowed the problem down to some processes at our end that were sending emails with bare LF line terminators. These turned out to be mails generated using the Oracle mail package, possibly where the body of the mail was copied from a Unix text file.

Interestingly these problems never appeared when the MTA was Postfix. It turns out that Postfix was doing some silent fixing of the bare LFs and replacing them with CR + LF pairs.

Oracle Logical Standby part I

1 Star2 Stars3 Stars4 Stars5 Stars (2 votes, average: 4 out of 5)
Loading ... Loading ...
Posted by jason on Jul 5th, 2005

Databases are very important to us at Nominet. Our main database of domain names has been running Oracle RAC since March 2003, we like the idea of having no single point of failure.

We have been trying to use Dataguard technology and in particular logical standby to set up a replicated instance. Our big idea with this technology and the BIG selling point of logical standby is that you can have your replicated database open and available to query whilst you are updating it with the updates your primary database recieves.

We thought this would be a great mechanism for creating a reporting database, physically separated from the main database that could be used for high volumes of queries without impacting our primary database. There were a few hurdles to overcome as originally our primary RAC cluster was running 9i which did not have support for as many datatypes as we needed. There was also a snazzy new feature in 10g, called immediate apply. This was the killer sell, it means any update done to the primary can immediately be applied to your logical standby as soon as it arrives at the standby destination (prior to 10g, you would define a lag and the primary would switch logfiles on this duration, only at this point could you then apply redo that had been generated).

We were hooked - here was a mechanism of having a real time but offline (as in not pointing at primary db) reporting system. So we took plunge and upgraded to 10g, we are running veritas on our cluster so we had to wait for a patch to enable 10g to happily co-exist with Storage Foundation for Oracle RAC. After weeks of badgering it did appear.

Oh boy does clustering change from Oracle 9i -> 10g. Suffice to say it’s clear Oracle are trying to take business away from cluster vendors, in 9i you had to have vendor clusterware, but from 10g you can run your RAC cluster with only Oracle software, which means if you already have vendor clusterware you will have 2 pieces of software checking for cluster membership.

With 10g you get a rather flaky piece of software to do the cluster checking called Cluster Ready Services. Woe betide you if you try and run the first release of 10g, 10.1.0.2 because you can find yourself in endless rebooting loops. You have to devote aclustered partition to CRS that contains voting disks (of course we already had voting disks thanks to veritas)

So back to logical standby, you can now create this without any downtime on the primary. You first off create a physical standby, and then run a series of commands to go from physical standby (database not available for reporting while applying redo) to a logical standby, we even created standby redo logs just to have the recommended configuration. The final command starts the logical apply engine working:

SQL> ALTER DATABASE START LOGICAL STANDBY APPLY IMMEDIATE;

The first time we went through this things started running very, very slowly - and I mean crawling.

Tip #1 Do NOT run the standby database with FULL transaction consistency

Changing to READ_ONLY (Though this is now depracated) makes things much much quicker!

This was only the beginning of the problems on our logical standby.

DNS Traffic Analysis

1 Star2 Stars3 Stars4 Stars5 Stars (2 votes, average: 4 out of 5)
Loading ... Loading ...
Posted by geoff on Jul 1st, 2005

Resolver Personalities

In the course of developing some DNS tools (which will be the subject
of a future post) we analysed the queries received at the authoritative
.uk name servers. We’ve
known for years that many of hosts sending large numbers of queries
were either misconfigured or attempting to harvest data. However
what caught our eye during this analysis was the various patterns
of behaviour of normal high-volume resolvers.

We observed that many resolvers from which our name servers received
large number of queries fell broadly into four categories, or
`personalities’:

  1. Resolvers serving large communities of interactive users:
    • Lots of queries for A resource records (RRs) with leading
      `www.’ label.
    • Peppered with occasional MX RR queries.
    • Fairly consistent ratio of RRs queried:
      Queries for A RRs: 96%
      Queries for MX RRs: 3%
      Queries for other RRs: 1%
  2. Resolvers serving busy mail exchangers:
    • Most queries for MX RRs, with the occasional query for A RRs.
  3. Resolvers serving large mailing lists:
    • Most queries for MX RRs, with the occasional query for A RRs.
    • Distinct from resolver serving busy mail exchanger in that messages
      frequently appear to be ordered, e.g. alphabetical patterns.
  4. Combination of 1) and 2):
    • More even distribution of queries for A RRs and MX RRs.

Even broken resolvers appeared to have distinct personalities:

  1. Resolvers which repeatedly look up the same name or set of names.
  2. Resolvers which appear to be attempting to resolve IP addresses but
    appear to be unintentionally appending “co.uk” to the end instead.

Why is this interesting? It provides a possible basis for
distinguishing between normal and abnormal activity. This may be
useful for implementing tools for detecting patterns of abuse.

Open Resolvers

One other interesting result of our analysis: we looked at resolvers
sending between 1000 and 4000 queries per hour to ns1.nic.uk and found that just over 50% of them are
`open resolvers’; that is, they will resolve recursive queries for
any host. While open resolvers are less of a menace than open mail
relays, which can be used to forward spam, they still pose a threat.
Notably:

  • They can be used to anonymise criminal activity
  • They can be used to distribute and anonymise domain name harvesting
    activity.
  • They can be used to for distributed denial-of-service (DDoS) attacks.

Historically the latter use hasn’t been attractive to attackers,
as typically DNS replies result in an amplification of only about
3× in the worst case. However, with the deployment of IPv6
and DNSSEC this amplification can reach as high as 20× -
30×. This could see open resolvers used as yet another weapon
in the escalating DDoS wars.

Recent Posts

Highest Rated

Categories

Archives

Meta: