Know when to hold ‘em and know when to fold ‘em

All of us in “the IT biz” have war stories about the things that have gone horribly wrong at one time or another and we all have the battle scars to show off to all and sundry as proof of the battles we have fought (and hopefully won).  And all of us have learned (or will learn, it’s inevitable) some hard truths.  One hard truth is that there can come a time when things are just too broken to fix – servers, networks, domains, whatever – and that the best thing you can do is to nuke things from high orbit and start fresh.  The trick is to learn how to tell when you are too far in to fix things and then make the decision to call in the strike force.  This is a lousy place to be when it’s your own kit that is a mess, it’s a much harder place to be when it is someone else’s and you have been called in to fix things.

I have been working with a new customer over the past few weeks that has ended up in precisely this position and we just had to make the decision to nuke the current Windows domain and start fresh.  The customer’s sole IT admin had been tasked with migrating them from SBS2003 to SBS2011 and he did all the “right things”:  he read up on the migration procedure, he purchased a new well spec’d server (or so he thought) and he attempted to follow the Microsoft migration path step-by-step.  So, how could it all end up so wrong???  The answer is it was a bunch of things that when added together created a “perfect storm” of problems.

Some of the major things were as follows:

Problem 1 was the fact that he was trying to migrate out of a very “customised” SBS2003 installation and he didn’t have enough exposure to SBS to know that the installation was very “non-standard”.  SBS2003 had been installed by a third-party that broke a lot of SBS rules so things were already in bad state, from an SBS migration point of view.  SBS is a wonderful package but you DO have to resist the urge to break out of the SBS way of doing things or all sorts of issues can crop up to bite you.  This is specially true when it comes time to migrate to a newer version.

Problem 2 was caused by a DFS installation that had spanned three servers in two separate physical locations (across a VPN) that was corrupted by the improper removal of the original SBS2003 server as well as the earlier improper removal of a member server on the domain.  DFS was absolutely broken yet all of the AD objects for it were still intact.  The AD on the new SBS2011 server was flooding errors about the broken DFS in all the server logs and we found that we could NOT remove references and links to long-dead servers because of the DFS references.

Problem 3 was caused by the new server hardware itself.  The admin had been forced into buying a whitebox server that was supposed to offer hardware RAID5 as well as a few other features.  The box actually offered “mock RAID” via Intel’s hideous ICH10 on-board chipset.  RAID performance was awful off the raided SATA disks and SBS was initially loaded while the box was missing half its required RAM and a second processor (adding missing RAM and second processor made little difference in the end to performance).

There were a lot of other problems but the above lists the major issues.  By the time we got called in there was not a lot that we could do.  We tried all sorts of things to try and get around the issues including nuking Symantec a/v on the server (for whatever reason it took the SBS box into the basement) and all sorts of band aid fixes to try and make the server perform.  Nothing worked and we really weren’t sure if we were fighting just Windows domain and configuration issues or was there also a fundamental hardware problem at play.  In the end it was decided that the server would be replaced with a “properly” configured Dell T320 (kick ass hardware RAID 10, SAS drives, the whole enchilada) and we would migrate SBS2011 and clean things up during the migration.

What is it they say about the “best laid plans of mice and men”???? 

SBS migration flat out failed to the new Dell; Exchange would not install, SharePoint would not install, IIS would not install and there were many, many other errors.  That’s when the lights came on and we decided to just replace the Windows domain with a net-new domain using a clean SBS2011 install on the new server.  And that’s where we come back to the whole original point of this post which is to learn to recognize when things are just too broken to continue.  If your gut is telling you that things are a total mess then they probably are.  My gut was screaming at me after my first hour on site with the customer but they had to learn, the hard way, that things were just too far gone to cleanly recover.  Initially, there was no way they were going to look at replacing the whitebox server and there was no way they would look at chucking their current Windows domain.  And that’s the trap that most of us fall into – we want to believe that anything we do with our technology is “reversible” and “fixable” and, frankly, it’s not.  A lot of things we do truly are one-way streets so we had better know what we are doing before we hit the buttons.

Was I as culpable in this case as my customer?  Oh yes, without any shadow of doubt.  I ignored what my gut was telling me (and the evidence too, for that matter) and tried to work with what was obviously a horrendously broken domain and server rather than lay the arguments out, up front, to nuke and pave over and start fresh.  We wasted a lot of time and effort that should have been applied to migrating to the new domain.

The takeaway is simple:  stick with what the evidence and your gut is telling you when you have to make the go/no go decision and strive to keep other factors (the emotional ones) out.  If you are faced with the hard IT decision it is best to lay out possible alternatives and then evaluate them as coldly as possible.  There is going to be a lot of work involved no matter what so you want to ensure the work effort will result in the best possible outcome EVEN if it means you have to completely change track. 

Always take the long view when analysing the possible results of your actions, resist the urge to “bandaid” things (and this is true of your initial planning as well as your reaction to the go/no go question).   Wherever possible, have your big IT plans validated by a dispassionate third-party resource just to be sure that you haven’t missed the obvious and, maybe, no-so-obvious flaws in the plan.  This is specially true when you are dealing with things outside of your experience (like a migration to a new O/S)!  And, finally, realize that you truly do only get what you pay for.  There is an old saying in IT that is oh so true, “you can have good, fast and cheap – pick any two”.  If your budget is such that you really DO have to cut corners then you need to be extremely aware of what the impact is going to be on your end results.  You can actually save money overall if you spend a bit more up front to ensure you aren’t  going to shoot yourself in the foot and then spend a lot of time and money to clean up after the fact.