Data Centers

Lock IT Down: Five not-so-common tips for disaster recovery

Several uncommon tips can supplement the formal disaster recovery measures already in place


To be prepared to successfully recover from a disaster, you must perform regular backups, use a hard disk array that supports fault tolerance, and make sure that you use a backup power supply. But there’s a lot more to being ready for a disaster than that. Here, I’ll explain some of the commonly forgotten techniques you can use to prepare for a disaster.

Spare hardware
Although server hardware is more reliable than it’s ever been, a hardware failure somewhere down the road is inevitable. Servers are constantly working, they’re constantly subjected to the heat given off by the processors, and the hard disk is constantly grinding away. Parts eventually wear out or burn out. Because you never know when a machine is going to die, or which component will cause the failure, it’s a good idea to be prepared for a variety of hardware failures.

Network card failures
One of the components that most commonly fails is a network card. Normally, replacing a network card is no big deal. But doing so requires taking the server down, removing the cover, replacing the card, loading the drivers, and configuring TCP/IP for the new card. This process can take 15 to 20 minutes to complete after you’ve diagnosed the problem, assuming that you have a spare network card on hand. During this time, your users can’t access the server, and that could cause a variety of problems. However, you can do a few things to significantly reduce the server’s downtime.

If you have a free slot in the server, I recommend installing a spare network card. Configure the card (all but the IP address) and then disable it. Your server will run as if it only contains one network card. If that card ever does fail, all you’ll have to do is move the cable to the new card, enable the new card, disable the old card, and set the IP address. You could have the server back online within four or five minutes. This procedure saves a considerable amount of time because you don’t have to remove the server’s cover or interact directly with the hardware in any way other than plugging in a network cable.

If you don’t have room for a second network card in your system, you can still reduce the time it takes to recover from a network card failure. Make sure that you have a spare network card on hand, but not just any network card. Make sure that the spare card is identical (same brand and model) to the network card that’s currently in the server. Then, during a network card failure, you can simply swap cards and be done. You won’t have to worry about loading drivers or reconfiguring any other settings.

Power supply failures
Another commonly failed component is the power supply. The power supply is a metal box inside the computer that converts raw AC power into several different low-voltage DC outputs that the computer can use. Although it’s never a bad idea to keep a spare power supply on hand, there’s a better way of dealing with a power supply failure. Several companies make computer cases that contain a second power supply. These cases are designed in a manner that allows you to plug both power supplies into the wall. If one of the power supplies fails, the other power supply will immediately take over.

Clustering
Perhaps the best way of dealing with a hardware failure is through the use of clustering. You can employ different types of clustering, such as fault-tolerant clustering or load-balancing clustering. The clustering model I’m talking about, though, actually involves having two servers that are connected to each other through a dedicated network link. This link is used by each of the two servers to monitor the status of the other server. If one server fails, the other server instantly takes over. What makes this possible is that both servers share a common hard disk array; the two machines aren’t working with copies of a set of data but are actually sharing the exact same data.

Clustering is wonderful. As long as you use a fault-tolerant hard disk array and uninterruptible power supplies, a clustered server pretty much guarantees 100 percent server uptime. The server can survive just about any type of failure. Of clustering’s downsides, the primary one is cost. Clustering takes a lot of hardware, and hardware costs money. This type of clustering also requires you to use Windows 2000 Advanced Server, which is more expensive than the standard version of Windows 2000 Server.

A spare server
If a full-blown cluster isn’t in the budget, you’re not totally out of luck. You can still minimize your network’s downtime. As you saw earlier, keeping spare parts on hand is very effective. It’s smart to keep spare memory, a spare video card, etc., on hand in case a component goes bad. A company I once worked for had a similar, rather unique approach to server failures.

This company’s IT staff initially dealt with a server failure just like anyone else by quickly trying to determine the cause of the failure. If the failure were caused by something obvious, they would swap out the part and get the server back online. However, if they couldn’t determine the cause of the problem within five minutes, they had a spare server on hand. This server had exactly the same hardware as the other servers in the organization. Therefore, they could simply perform a quick hard-drive transplant from one server to the other. This would get the server back online within a matter of minutes. The support staff was then free to work with the failed server in a more leisurely manner to determine the cause of the problem. Because the network was up and running, they could take their time and do the repair correctly, rather than slapping a Band-Aid on the problem just to get the network back up.

If this sounds like a neat trick to you, remember that to pull the trick off, the hardware must be virtually identical. The two servers can have different amounts of memory, but the other components must be the same because when you transplant the hard drives from the old system to the new system, Windows has no idea that it’s running on a different computer. If the hardware is different, Windows will attempt to load incorrect drivers for the hardware. Sure, you could always load alternate drivers for the new hardware, but doing so takes time. Besides, you’ll probably want to move the hard drives back into your original system when it becomes available again, and it would be a shame to have to reconfigure the drivers twice.

As you can see, the idea behind these techniques is that you can’t prevent a failure from happening. The best thing you can do is anticipate a failure and be ready to get the server back online quickly and repair the problem later.

Documentation
A critical failure may occur that you’re simply not prepared to deal with. Even if you have lots of spare hardware and a hard drive with a good configuration on hand, something can still go wrong. A hub could lose power, a router’s routing tables could get messed up, or your data could get trashed.

In such situations, it’s important to have good network documentation available. It would be easy to write an entire book on documenting a network, but space doesn’t permit me to go into that much detail. Let’s talk basics; what’s important in network documentation?

For starters, your network documentation should be easily accessible. I have a friend whose network documentation includes a million file folders scattered all over her office. If a failure occurred and she wasn’t available to deal with it, the others in her office would have a tough time finding the necessary documentation. I recommend storing all of the documentation in a single binder.

Make sure you have contact information for your entire technical support staff, the building maintenance staff (in case you have to get into the attic or under the crawl space or something), and all of your hardware and software vendors. Believe me, it’s very empowering to have all of this information in one place.

Now, it’s time to talk computers. I recommend having a network diagram. Your diagram doesn’t have to show individual PCs, but it should show the servers. I also recommend showing hubs, routers, subnets, etc. You might also consider diagramming sites and dial-up connections. The more detail your diagrams have, the more they will help you during a disaster.

Conclusion
Preparing for a disaster can often be a disaster in and of itself. The planning, the documenting, the coordination: It’s all too often a massive guessing game. With a little forethought, however, you can avoid this mess. The most important thing to remember when you’re justifying the time spent in this preparation is that a good disaster recovery plan will have your operation up and running exponentially faster should a disaster strike. And when bottom line is key, production must continue.

Editor's Picks