Power outage and our response
On Friday 28 November 2008 we suffered a power outage at our Oxford HQ lasting for around 90 minutes. We have a generator and N+1 UPS, so we expected no issues if the power supply to the building went off. However, the generator failed to supply power, which meant that we were running off the UPS for the length of the outage.
We test the generator regularly and were sure it would kick in. Trying to switch it over by hand produced an error, as the generator could sense there was power coming from the grid. Meanwhile, to extend UPS runtime only essential services were left running. All non-essential servers were powered down. However, the server room air-conditioning is not powered by UPS. This was a conscious design decision as it should be powered by the generator during an outage, and would not need ‘always-on’ power, unlike server hardware. The time it takes for the air-con to restart once the generator kicks in would not noticeably affect the overall temperature in the server room. But without air-con we knew we would eventually hit a temperature at which the servers could not run. We ended up opening all available windows and doors to give us some more time.
During the outage we were able to offer all of our online services without interruption. We were monitoring the remaining UPS runtime and room temperature constantly. Every half hour we discussed whether we should move our services so that they were running from our backup datacentre, though power returned before this was necessary.
Whilst services continued we did have some issues around communications. One problem area was our phone system. Desktop handsets were not receiving power, even though our VoIP server was running, so no calls were getting through. It did take some time to put up an auto-response message, and we have now changed the system configuration so that this is much easier to do.
We put a notice up on the website to notify anyone trying to contact us that we were having power related problems. This notice stated that the problems were due to a ‘power outage in the Oxford area’. This was not true. The original problem had been widespread, but the issues we were seeing were local to our building. At first we in the technical department thought that this was deliberate obfuscation to absolve Nominet of responsibility for the problems we were seeing. I now believe this was an honest mistake caused by miscommunication during a stressful time. But I do think that as a principle we should be honest about what caused our localised problems. The event post mortem highlighted that we needed a better structure for internal communication during emergencies.
So, what went wrong? I designed the power resiliency for our Oxford HQ, and I believe that my design was right despite the failure we saw on that Friday. Any services which cannot suffer interruption are protected by redundant UPS. Those which can suffer a small amount of downtime will be powered by the on-site generator in the event of power loss to the building. Testing had shown that the generator worked when needed. We should not have seen any outage longer than a few minutes on these non-critical services. However, a momentary loss of one phase into the building had tripped the Motorised Mains Circuit Breaker (MMCB). This fault condition does not automatically start the generator, indeed it prevented us from switching it on manually. Electrical engineering is not my forte, and I am still trying to understand why it was configured like this, though I have been assured it is standard practice.
The solution turned out to be very simple. The MMCB was reset and power returned. The post mortem identified that there was a gap in our knowledge of how the power was provided to the building. We now have that knowledge, but we are also considering retro-fitting a Building Monitoring System (BMS). This was considered during the building design stage but rejected on cost-benefit grounds.
I would like to thank Andy from A1 Electrical who, despite not being bound by a support contract, arrived promptly, diagnosed and fixed the problem.

