random technical thoughts from the Nominet technical team

Power outage and our response

1 Star2 Stars3 Stars4 Stars5 Stars (3 votes, average: 5 out of 5)
Loading ... Loading ...
Posted by ian on Dec 24th, 2008

On Friday 28 November 2008 we suffered a power outage at our Oxford HQ lasting for around 90 minutes. We have a generator and N+1 UPS, so we expected no issues if the power supply to the building went off. However, the generator failed to supply power, which meant that we were running off the UPS for the length of the outage.

We test the generator regularly and were sure it would kick in. Trying to switch it over by hand produced an error, as the generator could sense there was power coming from the grid. Meanwhile, to extend UPS runtime only essential services were left running. All non-essential servers were powered down. However, the server room air-conditioning is not powered by UPS. This was a conscious design decision as it should be powered by the generator during an outage, and would not need ‘always-on’ power, unlike server hardware. The time it takes for the air-con to restart once the generator kicks in would not noticeably affect the overall temperature in the server room. But without air-con we knew we would eventually hit a temperature at which the servers could not run. We ended up opening all available windows and doors to give us some more time.

During the outage we were able to offer all of our online services without interruption. We were monitoring the remaining UPS runtime and room temperature constantly. Every half hour we discussed whether we should move our services so that they were running from our backup datacentre, though power returned before this was necessary.

Whilst services continued we did have some issues around communications. One problem area was our phone system. Desktop handsets were not receiving power, even though our VoIP server was running, so no calls were getting through. It did take some time to put up an auto-response message, and we have now changed the system configuration so that this is much easier to do.

We put a notice up on the website to notify anyone trying to contact us that we were having power related problems. This notice stated that the problems were due to a ‘power outage in the Oxford area’. This was not true. The original problem had been widespread, but the issues we were seeing were local to our building. At first we in the technical department thought that this was deliberate obfuscation to absolve Nominet of responsibility for the problems we were seeing. I now believe this was an honest mistake caused by miscommunication during a stressful time. But I do think that as a principle we should be honest about what caused our localised problems. The event post mortem highlighted that we needed a better structure for internal communication during emergencies.

So, what went wrong? I designed the power resiliency for our Oxford HQ, and I believe that my design was right despite the failure we saw on that Friday. Any services which cannot suffer interruption are protected by redundant UPS. Those which can suffer a small amount of downtime will be powered by the on-site generator in the event of power loss to the building. Testing had shown that the generator worked when needed. We should not have seen any outage longer than a few minutes on these non-critical services. However, a momentary loss of one phase into the building had tripped the Motorised Mains Circuit Breaker (MMCB). This fault condition does not automatically start the generator, indeed it prevented us from switching it on manually. Electrical engineering is not my forte, and I am still trying to understand why it was configured like this, though I have been assured it is standard practice.

The solution turned out to be very simple. The MMCB was reset and power returned. The post mortem identified that there was a gap in our knowledge of how the power was provided to the building. We now have that knowledge, but we are also considering retro-fitting a Building Monitoring System (BMS). This was considered during the building design stage but rejected on cost-benefit grounds.

I would like to thank Andy from A1 Electrical who, despite not being bound by a support contract, arrived promptly, diagnosed and fixed the problem.

Apple and libnet

1 Star2 Stars3 Stars4 Stars5 Stars (2 votes, average: 1 out of 5)
Loading ... Loading ...
Posted by alexd on Dec 16th, 2008

A lot of people I know get very excited by Apples. In the interests of spreading my bets, one of my machines is an Apple. Maybe it’s just me, but I just don’t get the usability benefits that everyone raves about.

For example, I found this great little library called libnet. It allows you to do raw socket manipulation in a platform-independent way, hiding a load of gory details. I was having some trouble testing libnet code, so I thought I’d try everything on my own network (to make sure that firewalls weren’t getting in the way). “Great!”, I thought, “I’ll try my Mac”.

Unfortunately, although libnet compiles and installs to Max OS X, you can’t actually use it to write to raw sockets :

“Write error: libnet_write_raw_ipv4(): -1 bytes written (Invalid argument)”

I can find this terse response from Apple.

The solution? Boot up the Linux VM! :0)

You could reasonably point out that there is simply no support for the latest OS X in a library which was last released several years ago - but the fact remains, it is unusable on a Mac! I have had similar issues with Java and Ruby code; it seems like I am tending to do more work in VMs, and less work on the Mac itself.

Maybe it’s just me…

Dnsruby now compatible with Ruby 1.9.1

1 Star2 Stars3 Stars4 Stars5 Stars (2 votes, average: 3 out of 5)
Loading ... Loading ...
Posted by alexd on Dec 5th, 2008

When Ruby version 1.9 was released at Christmas, I was awfully excited. I had a couple of days to make a release of dnsruby before I headed off to New Zealand, and I was very keen for the release to be compatible with Ruby 1.9.

I spent a few very stressful days, and released dnsruby with no support for 1.9. Everything just seemed broken, and I still had a long way to go! Happily for me, it turned out I was not alone. As Dave Thomas notes, 1.9.1 was “less than stable”.

So I was pleased to be able to spend some time with Ruby 1.9.1 recently. Again, nothing worked at first, but there wasn’t too much to getting dnsruby working with Ruby 1.9.1. The main issues I had were to do with Strings and binary data (presumably owing to the internationalisation work which has gone into Ruby 1.9). At least, that’s they turned out to be once I’d debugged some very odd symptoms!

I’m now happy to be able to announce that the latest version of dnsruby (1.23) is now fully compatible with Ruby versions 1.8.7 and 1.9.1. There are a few more platform checks than I’d have liked (dnsruby now checks whether it is running on windows or java, and what version of Ruby it is running). I can only see these proliferating if Dave Thomas’ recent advice to fork Ruby is followed.

Other improvements to the library include the DLV record, SHA-2 support in DS record processing, and various security features and fixes.

As always, I’m keen to get any feedback on the release.

Recent Posts

Highest Rated

Categories

Archives

Meta: