No one likes to think about the possibility of some cataclysmic disaster occurring on their server. However, servers are nothing more than machines and, like all machines, are prone to mechanical failures. The only difference is that when a server breaks down, the stakes tend to be a bit higher. After all, you’re not going to lose millions of dollars worth of data if your car breaks down or if your refrigerator goes out. But that’s not necessarily the case with servers. If a server fails, at the very least, your company’s electronic assets could be unavailable for a while. At the worst, those assets could be gone forever. Since there’s really no way to prevent individual server components from failing, the trick is to build a level of fault tolerance into the server. This will make it much easier to recover from a crash situation. In this Daily Drill Down, I’ve compiled a summary of some of the disaster prevention techniques that have worked best for me over the years. As you’ll see, some of these techniques are brand-new, while others have stood the test of time.

Hard disk arrays
When many people think of server fault tolerance, they naturally think of hard disk arrays. That’s because hard disk arrays are perhaps the most highly publicized fault tolerant methods in Windows NT and Windows 2000 environments. This may be because the Windows operating system contains built-in support for arrays. Not all hard disk arrays are created equally, though. There are several different fault tolerant implementations for hard disk arrays.

You’re probably already familiar with the basic hard disk array implementations, such as striping, striping with parity, and mirroring. However, there’s a new type of RAID array called RAID 10. RAID 10 consists of striping with parity combined with mirroring. To understand the benefits of this technology, you must first understand the advantages of the other three types of hard disk array implementations I mentioned.

Striping offers no fault tolerance. The only advantage to striping a drive is that data can be read and written much more quickly because each hard drive involved in the stripe set only contains a fraction of each file. Striping with parity offers high speed similar to that achieved with striping but has a degree of fault tolerance built in. If any one hard disk in the array were to fail, the other drives contain enough information to fill in the missing pieces. The array will continue to function until the failed drive is replaced, at which time the new drive will be filled with the data contained by the old hard disk.

Mirroring works differently from striping. In a mirroring environment, any data that’s written to one hard disk is also written to the other. If a hard disk goes bad in a mirror set, the good hard disk still contains a copy of all of the data and can pick up the slack until the bad hard disk is replaced.

Now that you’re familiar with the more common types of hard disk array implementations, let’s take a closer look at RAID 10. As I mentioned earlier, RAID 10 combines striping with parity with mirroring. The reason is that when a hard disk goes bad in a striping with parity environment, the other hard drives contain enough data for the server to continue functioning. However, the server’s performance will slow to a crawl compared to its normal performance. In a mirror environment though, if a hard drive fails, the other drive picks up the slack and there is no noticeable decrease in performance. Of course, that’s because mirroring only involves single hard drives—until now.

In a RAID 10 environment, the server has the benefits associated with striping with parity (speed and fault tolerance). However, the stripe set is also mirrored. Therefore, if any hard disk in the array goes bad, the server will switch to the mirror set and continue working off of it. This means that in a disk failure situation, no data is lost and performance doesn’t suffer.

Traditionally, RAID arrays have been expensive because of the number of hard drives involved and because of the cost of the necessary RAID controller. However, at the fall COMDEX in Las Vegas, I saw a demonstration of a RAID 10 controller that was also designed to reduce costs. The controller itself hadn’t actually gone on the market yet but was intended to sell for around $500. The other benefit was that the controller was designed to use IDE hard drives, which are much cheaper than the SCSI drives that are usually used for RAID implementations. In spite of using IDE hard drives, the machine’s performance was similar to that of a machine using SCSI drives.

If you’re working with a Windows 2000 Advanced Server environment, another way that you can protect your servers is to implement clustering. You could easily write a book on clustering, and space doesn’t permit me to go into great detail. I can tell you that there are two types of clustering that are supported by Windows 2000 Advanced Server.

One type of clustering is called Network Load Balancing. This type of clustering involves multiple machines running a common application. This implementation is perfect for Web servers. When a client tries to run the clustered application (or access the Web site), the Network Load Balancing service will determine which server has the least workload and will direct the client to that server. If a server in the cluster happens to crash, the Network Load Balancing service is smart enough to detect the crash and reroute the clients to another server until the failed server becomes available again.

The other type of clustering is the type you’d use to protect mission critical data servers. This type uses two identical servers that function as a single server. In this implementation, if either server were to fail, the other server would keep working. The only things the two servers share are a network link that they use to communicate with each other and a common hard disk (usually a RAID array that supports striping with parity). The downside to this type of clustering is that it can be very expensive because it requires very specific types of hardware. Not just any server will do. In both types of clustering, the clustering service has some overhead, so the machines won’t perform quite as quickly as they would if they didn’t have this extra overhead.

OS backups
One of my personal favorite disaster recovery techniques works very well for offices that are on a budget. As you’ve no doubt figured out, hard disk arrays can get a bit pricey, and clustering can be very expensive. But there’s still a way for smaller businesses to implement a degree of fault tolerance.

One of the best ways of implementing low-budget fault tolerance is to create an image file of the operating system portion of each of your servers and of each application installed on it. The best method that I’ve found is to use Symantec’s Norton Ghost 2001. Ghost 2001 can create an image file based on an entire hard disk or an individual partition.

In the network that my husband and I use for our business, we installed a CD-RW drive in each of our servers. We then organized our servers so that the network operating system (Windows 2000 or Windows NT) and all of the applications are stored on a single hard disk or partition. Any data that might be stored on the server is stored on a separate hard disk or partition.

As with most networks, we run a nightly backup of our data. We’ve found that although nightly backups work well for data, they aren’t so good for operating systems and applications. That’s because although you can back up an operating system, the operating system itself must be running during the backup process. This means that some system files will inevitably be open during the backup and will therefore not be backed up.

Even if you could back up every system file, if you ever have to restore the backup, the restore process could take forever. That’s because before you can even think about restoring the backup, you must install Windows on the failed server. Once Windows is installed, you’ll have to install things like the device driver for the tape drive and the backup program. By the time that’s all said and done, it could take all night to get your server back online. I personally don’t like the idea of wasting a perfectly good night on a restore process, especially since I need our servers to be functioning 24/7.

The reason Ghost 2001 works so well is that it runs from outside the Windows operating system. At the time you install Ghost 2001, it creates a boot disk, which contains a wide variety of CD-RW drivers and the Ghost program. The disk is designed to be universally compatible, so you can use the same disk on each server. We had to do some manual tweaking on the disk to change the DOS version from PC DOS to the version that ships with Windows 98 because of some memory problems, but after doing so, the disk worked on all of our servers even though they were all running different operating systems, file systems, and had different brands of CD-RW drives.

Once we had a functional boot disk, we’d boot the server from the boot disk and create an image file of the disk or partition containing the Windows operating system and our applications. We had Ghost to place the image on CD-R disks. Although the servers contained too many files to fit on a single CD-R, the Ghost program was smart enough to span the disk images across multiple CDs. On each server, it took between two and five CDs to get the job done. Normally, it would have required more CDs, but the Ghost program offers several different levels of data compression.

The restore process is even easier than the backup process. During the backup process, Ghost asks you if you want to make the CD bootable. If you tell it Yes and you leave your Ghost boot disk in the drive, then the first CD in the set that you’re creating will be bootable. You can then simply insert the CD into the drive and boot from it. Your server will automatically load Ghost from the CD, and the first file from the backup set will also be right there on the CD to get you started. Because the restore process works from outside of Windows, you don’t have to go through the trouble of installing an operating system and all of the stuff that goes with it. What’s even better is that when the restore process completes, not only will you have a fully functional operating system, but all your device drivers and applications will also be installed. You’ll have a fully restored server in a matter of minutes. My husband and I attempted a test restore on a Windows 2000 server with about six common applications installed. The restore process took about half an hour. That’s pretty good, considering that when we initially installed the Windows 2000 operating system and the applications, the process took about four hours.

For our organization, Ghost 2001 has been terrific, but before you get too excited, remember what I said at the beginning of this section. Although Ghost 2001 supports file systems and operating systems that are traditionally used for servers, it’s basically designed for workstation-type environments. This means that it doesn’t support disk arrays. If your server doesn’t use a RAID array, Ghost 2001 will probably work fine for you, but if you do use RAID arrays, you’re better off sticking to one of the other protection methods I discussed.

I should also mention that when I had to change the DOS version because of memory problems, I found that doing so is a documented procedure. The Symantec Web site explains the type of errors this corrects and describes the procedure.

In this Daily Drill Down, I’ve reviewed some of my personal favorite disaster-prevention and recovery techniques. While some of these techniques have been around for a while, others are brand-new and should be tested before being implemented in a production environment.
The authors and editors have taken care in preparation of the content contained herein but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for any damages. Always have a verified backup before making any changes.