Nominet’s nameservers are hosted by various Internet organisations (ISPs, IXPs, and others), but managed by Nominet staff. The zones are updated dynamically by a process which responds to DNS-affecting Automaton requests. This process runs at least once a minute, and the larger zones can change as rapidly as that. All of the eleven authoritative nameservers are monitored to ensure that the zones are up to date, within a small margin to allow for the propagation of changes to the various locations.
On the evening of Wednesday 10 October our monitoring system determined that one of them, ns7.nic.uk, was falling behind. This was brought to my attention by the on-call engineer. His first concern was whether this was something that we should include in our new Service Announcements RSS feed. Rather than deal with this issue, my focus was on stopping the nameserver from responding with potentially incorrect information. I feel that a DNS server which responds with an incorrect answer whilst claiming to be authoritative is worse than one that does not respond at all. Our resilience to nameserver failure is good enough that we can sustain an outage on one nameserver quite easily. My instruction was to prevent this nameserver from responding to DNS requests.
Once the nameserver process had been stopped on ns7.nic.uk we had to address whether this was something worthy of a Service Announcement. My decision was no, it wasn’t. The Service Announcement feed is for events that will affect, or are currently affecting, a large number of Nominet customers. DNS is a transparent service. The lack of DNS, or serious degradation of the service, would be enough to warrant an announcement, but not the loss of a single node.
So, what caused the failure? The company providing hosting for ns7.nic.uk has several different transit providers. It appeared as if one of these started blocking outgoing traffic from port 53 over both TCP and UDP at around 21:00 UTC. This had the effect that the nameserver was receiving notifies from other nameservers, but its update requests were being blocked at the border of the hosting company’s network. At the time of writing we are still not clear why this block was put in place, but we are chasing it. The hosting company tore down the BGP session with that transit provider and normal service was resumed. We are still in negotiations with all parties, so for now the session remains down.