Some days you’re the bear and other days the bear gets you …

There are times when those of us that work in IT Consulting have the same kinds of experiences that customers have when it all goes horribly pear shaped.  So, this is a cautionary tale that proves even the “gods of IT” (hmmm, maybe I’ll trademark that) are mere mortals.

I shouldn’t have to explain the need for one or more UPS’s in a server environment but, for those who may not know, a UPS provides clean, filtered power to computers and peripherals and also provides backup power from a battery in case of a power failure.  In most cases, the UPS will run for X amount of time on battery and then will signal servers and devices to shutdown cleanly if the battery level runs down past a certain level.  The idea is the UPS will provide “bridge power” for a short power outage, say 15 minutes or so, and then provide the mechanism to cleanly shutdown servers and devices if the outage runs longer.  There are some gotcha’s waiting out there in UPS land as all UPS’s are not created equal; you may have super fast servers that are super sensitive to power fluctuations and they require fast-switching UPS’s whereas other devices may be more tolerant of a “slow switching” UPS.  Suffice it to say, you do need to do your homework when selecting and purchasing a UPS but you can save yourself a lot of hassle by following one simple rule — don’t cut corners on the UPS!

Anyway, we followed our own best practice when we built out our server rack and specified multiple UPS’s to support our servers (currently 5 in the rack) and our SAN.  We are lucky in that we are located in the Save-On-Foods Memorial Centre in Victoria and as we share infrastructure facilities with the building our server rack is located in an area of the building that is supplied with backup power by a diesel generator.  We are supposed to have power supplied by the generator within about 15 seconds of a power failure therefore our UPS’s should really only have to provide bridge power for that time.  No big deal, right?  Well, here’s where the cautionary tale comes in …

An infrastructure is only as good as the planning that goes into it and the execution that puts it in place.  If you miss something you are going to pay.  We planned our UPS installation properly and we thought we had executed correctly but sometimes things just don’t work as planned.  The first inkling that there might be something not quite right came when we suffered a sudden failure of one of our big VMware servers, we lost all of the VM’s on it all at once.  There was no indication of any sort of problem prior to the outage so we caught totally by surprise.  When I went down to the server room to check things out I discovered our rack had been moved by one of the building engineers and the power supply plug to one of the UPS’s had been pulled out of the power socket.  Obviously the UPS had gone on battery, the batteries had flattened and the UPS shut down.  The server in question had BOTH it’s power supplies plugged into the one UPS (NOT best practice) so there was the reason for the server crash.

The flattened UPS would not power up which made sense as the unit was configured to NOT turn on until the batteries had recharged to at least 50% capacity.  I plugged the server into the other UPS and started it up and made a mental note to check the UPS config once it restarted as I wanted to know why we weren’t notified of the power failure.

After the UPS came back online (a few hours later) I discovered that while we had configured email notification for alerts from the UPS the “enable” tick box hadn’t been checked.  DOH!  I fixed that and thought we were good at that point.  I went back down to the server room and moved the power connection from one of the power supplies on the server that had crashed back to the UPS in question.  That should have left us in good shape, right?  Well, no, and here’s why.  One of our UPS’s (the one that had flattened) is a fast-switching type and the other is the more traditional slow-switching type.  We have at least one server that needs the fast-switching type so we try and split the power supplies on the servers between the two UPS’s. The idea being that the fast-switching UPS will keep the boxes running immediately following the power failure (the fast-switch) and then both UPS’s will supply power for the prescribed period.  This is an okay plan IF both UPS’s work properly, if the fast-switch UPS has an issue then there is a better than even chance the servers that require fast-switch will crash during a power failure.  I’m sure you know where this is going …

A few days after the original UPS incident the building experienced a real power failure (I had already secured power cables so they couldn’t be pulled put of the sockets again) and we lost severs again even though the generator had kicked in! No alerts just crashed servers.  This made no sense so down to the server room I go.  Lo and behold the UPS that had been the problem earlier was deader than a doorknob even though the server room had power!  No amount of fiddling with the UPS would wake it up, it appeared to have failed completely.  I shuffled power cables around and even put some cables direct on to the mains and brought all of the crashed servers back on line.  Time to investigate what had happened.

To make a long story short, it appeared that our fast-switch UPS was actually faulty.  We discovered this once the UPS finally came back online and we ran a calibration test on the unit.  As soon as we kicked off the calibration (a simulated power failure) the UPS crashed horribly and became as dead as a doorknob. This explained why the servers crashed as power would have been cut off immediately (no battery power) AND the UPS could not signal servers to shutdown as the whole unit crashed at once. 

The UPS had become faulty sometime between the last time we ran tests and calibrations on the unit and the first power failure where the unit actually failed.  All of our planning and design became compromised at that point because we had designed around both a fast-switch and a slow-switch UPS without taking into account the fact that a UPS *could* fail totally.  To be fair, none of us had ever come across a situation like this one.  Normally we see batteries start to fail (and the UPS tells you) or there is a wiring fault (and the UPS tells you) and that can be easily fixed.  A total failure of the UPS just seemed out of the question until this happened.

Moral of the story: plan for the worst and design accordingly.  In our case we have already ordered an additional fast-switch UPS (bigger and better, too!) and our UPS vendor is replacing the  failed unit under warranty.  And, yes, we will test more frequently; you should, too!  A UPS infrastructure is ONLY as good as the UPS’s are; are they healthy?  Do they alert properly?  Do they shut your systems down cleanly?  Don’t wait for the inevitable power failure to find out as you may not be as lucky as we were.  We didn’t suffer any permanent damage to our systems (although ego’s may have been slightly bruised) but that was more fluke than anything else.  Learn from this cautionary tale, we certainly have!!