Enterprise Software

Establishing a server triage policy

When a crisis or full-fledged disaster happens, every department will argue that it needs to be brought back online first. How do you choose? Here are some guidelines for helping you make the best decision.

The word "triage" is usually associated with hospitals and battlefields. It refers to a disastrous situation in which there are many people who need medical attention, but there aren't enough doctors or medical supplies to go around. The victims are therefore sorted according to whose injuries are the most serious or according to whom could get the most benefit from immediate treatment.

Obviously, this isn't a situation that you want to find yourself in. However, triage is also sometimes necessary in information technology. Imagine that the building that you work in was destroyed by fire, flood, tornado, runaway bulldozer, bratty kid, or some other destructive force. If you're prepared, you have a backup data center that has instantly taken over and the company is still online. If not, it's up to you to rebuild the network in an alternate location.

The problem is that rebuilding a network from scratch is a time- and resource-intensive process. It's impossible to just plug in a few computers and resume operations as though nothing ever happened. Instead, you need to prioritize the rebuilding process in terms of which systems are the most critical to the business.

Plan before you have to act

Before I explain how to make that determination, I want to talk about why this is so important. I own several different businesses and have had disasters occur on two separate occasions. In one instance, my business's e-mail server caught on fire. This meant that I was not able to receive or answer customer questions, and I was unable to receive notices of the orders that came in from my Web site.

The other disaster that occurred involved my domain registration expiring. While this was not a physical disaster, there were problems with reestablishing the domain name. My site was down for almost a week.

In both instances, my business was either crippled or completely shut down for a few days. During this time, I was losing a considerable amount of money each day because customers were unable to order products from my Web site or I was unable to respond to customer questions. Since this was a small business with a slim profit margin, a few more days of being offline could have caused the business to go bankrupt.

Even when everything was repaired and back to normal, there was long-term damage to the business. The customers who were lost during that time will never come back. I also lost my position within the search engine rankings because the search engine spiders could not locate my site. It took about three months after getting everything back online for business to return to normal.

My point is that unless you respond quickly and effectively, a disaster can cause your company to close its doors forever. It is therefore critically important to make good decisions as to what steps will be the most effective in getting your company back online.

Prepare the plan

Regardless of the type of company you work for, step number one should be finding a new place to set up shop. If your building isn't physically damaged, you might be able to skip this step. Otherwise, consider using another property that your company already owns or leases, such as a warehouse or a branch office. This will save money and time because you won't have to search for a new piece of property or waste time signing a lease.

Once you have a location to do business in, have the phone company reroute your telephone and Internet service to the new location. It has been my experience that this can be done within a few hours time if the phone company understands that it's an emergency situation.

When connectivity has been established, it's time to start setting up some servers. This is where things get tricky. If your old servers have been destroyed, then you won't have a choice but to buy new hardware. However, it can take weeks to get a check from the insurance company, so you will likely be limited to the amount of cash that you have on hand, which probably won't be enough money to replace everything.

For example, if you paid $30,000 each for fifteen servers, then it would cost you $450,000 to replace them all. If you don't have that kind of cash, think about which servers are the most critical and determine the minimum amount of computing power that could be used to provide those servers with a minimal level of functionality until you can get real servers.

You might decide that five of those fifteen servers are really critical to keeping the business's doors open, and although the other ten are important, they don't necessarily have to be available today. You might also discover that while those servers run best on quadruple processor boxes, you can run the critical services on a single processor box in a pinch (with the obviously decreased performance).

After performing this assessment, you might determine that rather than spending $450,000 to replace all of your server hardware, you can spend $15,000 on five high-end PCs and configure them to act as temporary servers until you can buy replacement hardware.

Getting back online quickly with minimal functionality is important, but it is equally important to make sure that you bring the appropriate systems back online first. The million dollar question is: How do you determine which systems should be brought back online first when everyone is screaming at you because they think that their systems are the most important?

Before you can bring anything business-related back online, you will need to get some infrastructure in place. Therefore, your top priority is to get at least one domain controller, a DNS server, and possibly a DHCP server back online. Beyond this, the decision making process isn't quite so clear cut.

I recommend planning ahead of time and getting upper management to make you a list of which systems take the highest priority in times of disaster. If a disaster has already happened though, you won't have that luxury and it will be up to you to make that decision.

What constitutes a critical system varies widely from company to company. However, if you want some general guidelines, I would bring a mail server online first so that you can communicate with your customers and employees and let everyone know that you are still in business. After doing so, I would bring online the systems that produce the most immediate income. By doing so, you keep the cash flowing and reduce the chances of the business closing its doors as a result of the disaster.

You can also narrow the decision-making process down a bit by deciding which systems are relatively unimportant. For example, it has been my experience that departments such as Human Resources and Marketing will often scream the loudest about needing to be brought back online, but often, their needs can be considered secondary to the immediate requirements of the business.

If you do decide to use high end PCs in place of real servers to quickly bring the most critical systems online with minimal functionality, I recommend going to a mom-and-pop computer store rather than to a large retail chain. You will likely get a better price, faster service, and you will be able to completely determine the specs for the machines that you are buying. Sure, you can pick up the phone and custom order a machine from Dell or Gateway, but you won't get same-day service. On the other hand, a small, independently owned computer shop will likely jump at the opportunity to get a $15,000 order and will probably bend over backwards to get you the hardware that day and to help you any way that they can.

Once you have the replacement hardware, it's just a matter of installing operating systems and restoring backups. You might also have to do a little reconfiguring to compensate for differences in disk structure or hardware capabilities. In addition, I recommend disabling any services that aren't absolutely critical so that you can reserve the temporary system's limited resources for your most critical applications.

When your most-critical systems are up and running (even if it's at a limited capacity), you can begin the process of rebuilding everything else. This means coordinating efforts with your normal hardware vendor, your insurance company, and whoever is repairing the damage to the old facility.

Editor's Picks

Free Newsletters, In your Inbox