This is a cautionary tale and one I urge you to take to heart.
When you purchase a server, or any other piece of critical IT kit, you generally get a chance to purchase warranty coverage or extended warranty coverage. In many cases the vendor will provide a basic, Next Business Day (NBD) warranty that probably ensures you’ll have some sort of response by the next business day (not, necessarily the actual next day; if your problem occurred on Friday you probably won’t have a response until Monday). What many of the vendor’s may not tell you is that the parts supply might not be all that great for NBD coverage; in other words, an NBD warranty may not necessarily guarantee you that a smiling service tech will show up the next day with needed part in hand. All of that depends on parts supply and how the parts are “tagged” for warranty coverage.
Case in point: we (itgroove) are situated in Victoria, BC which is the provincial capital of British Columbia. We are a “government town” and most of the Tier 1 vendors (Dell, IBM, HP, Lenovo) have hefty parts stocks locally to support government installations. I had a situation with a customer where one of their “critical” Tier 1 servers blew a motherboard but the server was covered by an NBD contract. The server could NOT be down for any length of time so I chatted with the Tier 1 support guys about options. It turns out that there were full spares kits for the server type on hand in Victoria but they were tagged for “4 hour critical response” contracts only. Parts for NBD contracts would have to be ordered in from wherever even though the actual required parts were sitting in a warehouse about 2.5km from my customer’s location. I was told that it could possibly take up to 3 days for parts to show up, depending on what was required. Worse, when I inquired about buying parts (as in I’ll just buy the motherboard, can’t be that expensive a part) the lead time was 5 – 7 days if the part was available for sale!
As the customer needed to be back up right now we ended up buying an uplift on their warranty to 4 hour critical support coverage and the nice tech arrived with the full spares kit within an hour and a half of the support call being placed.
The bottom line, here, is that you really, really, really have to think about what your allowable downtime window can be in terms of a failed server. If you can live without the services that are provided by a particular server for 24 – 48 hours then an NBD contract might be okay. If you have a way to “work around” the failed server for a similar period of time then that is okay, as well. But, if you are like my customer and you absolutely cannot live for more than a couple of hours without those services, then you have to factor in the cost of the warranty uplift into your costs calculations whenever you are looking at purchasing a critical piece of kit for your infrastructure. And if you really. really, really cannot afford any downtime then you need to be looking at ways to mitigate the problem before it even happens. That might mean having clustered or replicated servers, it might mean using Cloud services instead of on-premise services, it might mean having a sophisticated backup/recovery system in place that can get you going again in a very short period of time. In short, it means you have to invest dollars up front to ensure you don’t lose many more dollars because of a system failure in the future.
In my customer’s case the cost for the uplift was $2600; the potential cost to them in lost business to have been down for 24 or more hours was many times that amount so it only made sense to cut the PO. But this is a decision that should have been made up front and not in the heat of the moment. They should have planned up front for this situation knowing that they could not afford the downtime.
And that is the takeaway from this short discussion. You need to plan for the possibility and have your mitigation process in place up front. If you can’t afford downtime then you have to spend up front to put the “insurance” in place so that you are covered when something actually does go “boom”. This is true whether you are a multi-national or a small, single-server shop; if your business has a critical reliance on the box then you had better have a plan and coverage in place before it fails. Trying to pull it all together after the box has failed is just asking for trouble.