random technical thoughts from the Nominet technical team

The cause of, and failure to detect, a web site outage

1 Star2 Stars3 Stars4 Stars5 Stars (1 votes, average: 4 out of 5)
Loading ... Loading ...
Posted by ian on Jan 18th, 2008

On 10 January 2008 our web site was unavailable for several hours. The cause for this outage, and our failure to detect it threw up some interesting points.

We have always seen examples of abuse of our systems. Usually this takes the form of high volumes of requests sent to the whois server. We respond by throttling the traffic or, in more extreme cases, blocking the originating IP address from accessing the server. In early January we became aware that two IP addresses were responsible for 68% of the bandwidth to our web site. Each address was pulling our list of tags once a second, 24 hours a day. This is the biggest page on the website. Between them the two IP addresses were responsible for more than 20GB of traffic per mont. Either one was using more than ten times the bandwidth of any other address that accessed the web site.

Our web site sits in the DMZ, outside of the firewall. We have a Juniper router sitting in front of the DMZ and use ACLs to limit access to the web site. Now I like Juniper routers, particularly the CLI. Editing the config on a Juniper, especially if you are only used to Cisco, is a pleasure. However, you do have to be aware of the consequences of your actions. When the decision was made to block these IP addresses a new term was created in the ACL to block access to the web site.

This term consisted of:

  1. Source addresses - the offending IPs
  2. Destination address - our web site
  3. Action to be taken - in this case, discard all packets received

Pretty soon after the block was imposed we were contacted by the owner of the IP addresses. It seems they intended to pull the tag list once a day and had misconfigured the script. I’m of the opinion that there is no need to pull the list like this, but we decided to remove the block anyway. The engineer who had imposed the block decided to leave the term in place, in case it needed to be re-applied. He chose to remove the source addresses only. In doing this we were left with a term that read:

  1. Destination address - the web site
  2. Discard all packets received

Which blocked any access to the web site from outside of Nominet. This is the first interesting point. The decision to leave the term there was a reasonable one, and if only one IP address had been removed then we would have been fine. [There is an option to deactivate a whole term, but he was unaware of this.] It makes me realise that we need a proper firewall for the DMZ. Router ACLs are only really applicable at layer 3.

The next interesting point is that we were unaware that the web site was not visible to the outside world. The term was removed once we were made aware of the outage, but this information came from outside of Nominet. We have a sophisticated monitoring system, based around nagios. This gives us a fully configurable and timely view of our systems, but only as seen from within Nominet. We already put our authoritative nameservers within other people’s ASes, so these would be candidate sites for monitoring stations. But one thing I want to do is make more use of things like the RIPE NCC DNSMON service. This gives us a global view of .uk authoritative nameserver availability. At present we use this on an ad hoc basis when diagnosing nameserver incidents. I want to incorporate the raw data (which we have access to) into our monitoring system to ensure we see events that would not be detected by our monitoring system. Including incidents which segmented the network, for example, where we could still see the nameserver but half the internet could not.

I would encourage anyone who can to sign up for a RIPE TTM box to increase the coverage that DNSMON has.

Leave a Comment

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.

Recent Posts

Highest Rated

Categories

Archives

Meta: