Data Centers

Learn these lessons for effective disaster recovery

Know what steps to take in a disaster recovery effort.


Are you prepared for a full system recovery? When we think about disaster recovery, most of us make a lot of assumptions. We cover the big picture, but often we don’t give much consideration to the details. We back up regularly and send tapes off-site. If something ever happens to the server, we just plan on pulling out the tapes and restoring it.

In a recent test, I was asked to recover a Compaq 1500 series server running Windows NT 4.0. The rules of the game were simple. The entire system had to be recovered from tape with the original hardware no longer available. In fact, the new target server was an HP machine. “Piece of cake,” I thought. Load the OS on the new server, load the backup software on the new server, and restore the server from tape. At least that was the original plan. Here are some of the lessons I learned.

Lesson 1: Know where the OS is located
The first lesson sounds so simple as to be laughable. But as I soon learned, the original server had the OS installed on a partition other than drive 0 partition 1. When the restore operation tried to recover the registry, it recovered it to the current system partition instead of the original system partition. The resulting mismatch of registry and system files resulted in a quick Blue Screen of Death.

The second reason it’s important to know where the OS is located is Boot.ini. When you perform a full system recovery, the Boot.ini file is restored to its original state. That means that it expects to find the OS on the same disk and partition as the original server. As it turns out, most of the other partitions on a system don’t matter to the OS, but it’s critical that the system and boot partitions are identical to the original server.

Lesson 2: The server won’t boot if it doesn’t have the right drivers
What’s the first thing you see after you restore a Compaq backup to an HP server and reboot? You may see a very ugly blue screen complaining about your SCSI drivers or a message indicating that you don’t have any disk drives. It’s a bit of a paradox. You have to restore the system, but the new hardware usually won’t boot with the old drivers unless the hardware is identical. So how do you restore a system AND load new drivers if the system won’t boot?

After a great deal of research, I found a Microsoft TechNet article that discussed replacing a failed RAID controller. The article suggested replacing the old driver with the new one by renaming files. For example, you might rename cpqarray.sys to cpqarray.old and copy hparray.sys to cpqarray.sys. It’s sort of like pulling the tablecloth out from under the dinner plates. The problem with this technique is that not all controllers share the same registry information. In fact, unless you’re working with the same hardware vendor, they’re not likely to have any registry entries in common.

The solution to this dilemma turned out to be quite simple. Instead of trying to boot the system to hardware you know won’t work, boot to the NT setup. When prompted, give setup the correct driver for your SCSI controller and perform an UPGRADE installation. The upgrade will run NTDETECT and allow all of the proper drivers to load, including video and NIC drivers (which are likely to be different). Once you’re finished with the upgrade, you should be able to boot the restored system.

Lesson 3: Know your service packs
The final step is to reload your service pack and hot fixes. Remember, we “upgraded” the OS and replaced perfectly good files with files from the original CD. Reloading your service packs and hot fixes should clear up any remaining OS anomalies.

Lesson 4: Don’t wait until it’s too late
This is the point in the recovery where you find out just how reliable your backups were. Were files open during the backup? If so, they probably weren’t backed up, unless you had an open file agent with your backup software. Did you use incremental backups or differential backups? Differential backups are much easier to recover, and they reduce the risk of data loss by providing multiple copies of your data. The best way to know how good your backups are is to do a test run like the one we did. I highly recommend this. If nothing else, it will make things much smoother and faster if you ever have a real disaster.

Disaster recovery tips
Here’s a summary of the valuable tips I learned during this process:
  • Tip #1: Just before you restore from tape, create a bootable floppy disk and an emergency repair disk. You never know when they will come in handy.
  • Tip #2: If you’re not sure where the system and boot partitions were located on the original server, recover just the Boot.ini file from tape to a temporary directory. It will tell you what disk and partition the OS is located on as well as the directory name where the OS is installed. Additionally, if more than one instance of NT is installed, it will indicate the default OS.
  • Tip #3: Store a print screen of Disk Administrator and a printout of Boot.ini off-site with your tapes. If possible, when you start to recover a server, duplicate the entire disk configuration. The target server should have the same number of logical drives. Partitions should be located on the same disks as the original system and be assigned the same drive letters. Whenever you change the disk configuration, make sure you update this information.
  • Tip #4: If you can, log off the server during the restore. This will reduce the number of files in use at the time of the restore. If you don’t (or can’t), you may receive errors when the current user profile is restored.
  • Tip #5: It’s a good idea to have your system and boot partitions on a standard SCSI controller without hardware RAID. You can use software RAID to provide fault tolerance. Using standard (natively supported) drivers for your boot and system drives makes recovery much easier.
  • Tip #6: Store the account name and password for the local administrator account off-site with your tapes. You’ll need to have a local administrator account on the server. If the server was a member of a domain and the domain controller is not available, having a local account is the only way you will be able to log on. If your password has changed since the backups were created, you will need to know the old password.
  • Tip #7: Document your current service pack level and all hot fixes applied to the server. Store this information off-site with your tapes. Applications can be fickle about the service pack they run. Disaster recovery is not the time to be testing an application with a new service pack.
  • Tip #8: Make a custom CD with your OS, service packs, hot fixes, and backup software on it. Store it with the tapes and the rest of your systems documentation off-site. Don’t forget to update it as your system changes.
  • Tip #9: Always maintain at least three complete sets of backup tapes off-site. When you need the tapes, it’s too late to find out that your backup software “missed” a file or two.

Summing up
The servers we attend to day in and out are as individual as we are. Each one has its own quirks and whims. A good rule of thumb is that until you’ve recovered and tested the recovered server four times, you probably haven’t discovered all of the problems you’re likely to run into.

What lessons have you learned about recovering systems?
What recovery tips do you have? We look forward to getting your input and hearing about your experiences regarding this topic. Join the discussion below or send the editor an e-mail.

 

Editor's Picks