What would you do if I announced that I intended to format the C partition on your production Microsoft Exchange server to run a quick test? Your first reaction might well be to grab the nearest blunt object to hit me on the head. Unfortunately, viruses, worms, incompatible patches, and other malicious events can have much the same effect on a server, often leading to the painstaking process of recovering Exchange from backups.
In this article, I'll show you how to protect your Exchange server with a method that recovers the system in 30 minutes! You may be thinking, “But I back up my Exchange database. Why would I even care about this?” An Exchange database is useless without the original Exchange server in the original domain environment, as you may be painfully aware. If the OS or the Exchange application were damaged to the point that the server was not recoverable, the data would be worthless because you couldn't just mount that database on any old Exchange server. Essentially, you'd have to create an exact replica of the original server and its environment to recover the Exchange database.
This complexity makes Exchange recovery an extremely tricky task. This article is meant to turn the tables on that process by showing you an alternative method of backup and recovery.
The official Microsoft method for Exchange recovery
If you’ve ever hired a Microsoft certified consultant or worked in a large corporation that runs Exchange, you're probably familiar with or are using the official Microsoft method for Exchange disaster recovery. Unfortunately, this official method, based on maintaining a mirrored parallel universe of your Windows domain and Exchange infrastructure, comes with some sizable hurdles.
You must build a backup domain controller on your existing network, and then put that server in an isolated LAN and promote it to be the Primary Domain Controller. Then you must carefully and meticulously build an Exchange server from scratch on that isolated LAN (without making a single mistake in spelling or syntax) using the identical settings of your production environment. Only then can this parallel Exchange box be used in the event of a catastrophic Exchange production failure.
This procedure is complex, prone to error, time-intensive, and expensive. I’ve personally seen a highly paid consultant spend two weeks implementing and documenting this disaster recovery plan for my company. In the end, that same consultant could not replicate the environment using the documentation he wrote himself and finally gave up after a day. Needless to say and without getting further into the details of this methodology, you don’t want to go down this road because there is a better way.
The fast way for Exchange recovery
My personal experiences with Exchange disaster recovery have taught me to refuse to believe that this is what I must live with to be able to recover Exchange. Believe me, I got plenty of flack for it from my colleagues and from the very same consultant who couldn’t read his own documentation (which he had expected us to use in case of a disaster). They insisted that Microsoft's method is the official way and is how all the big corporations do it.
At the time, in 2000, I was just beginning to use system imaging to start a large deployment of new workstations running Windows 2000 and all new applications. I soon began to wonder, why not use system imaging for servers, and Exchange in particular? It seemed a daunting challenge because these were high-end servers running complex SCSI RAID 5 configurations.
As it turns out, since hardware RAIDs are completely transparent to the OS and applications, Norton Ghost or PowerQuest Disk Image worked just as well on servers as they did on workstations. Once I managed to get this working, I built a test domain, complete with an Exchange server, and then took an image of the server's OS and the application partition. I managed to restore the server in less than 30 minutes after I formatted the C drive (the data resided elsewhere).
I was confident that this approach would work to restore the basic Exchange server, but I wondered if the reimaged server would recognize a more recent database if additional e-mails were sent after the system image was taken. I tried this in the lab and found that the image-restored Exchange server would indeed mount a newer database with more recent data. So, even if I had made a month’s worth of updates to the database from everyday use, the Exchange server would come up and access all the old and new data. I realized I had come across an excellent disaster recovery procedure for these servers, and all that was needed was some fine-tuning to the process.
The following refinements are what I came up with:
- Exchange server design
- Disk-imaging methodology
- Backup methodology
- Recovery procedure
Let's take a closer look at each of these issues.
Exchange server design
To make this recovery method feasible, you must follow a fundamental design on the Exchange server. Data must reside on a separate partition (preferably a physically separate partition, but logically separate is okay) from the OS and the application partition. You don't want to have to image a gigabyte of OS and applications plus 100 GB of data on a single partition. In that scenario, you lose the granularity and convenience of being able to recover the OS and applications without affecting the data.
For existing servers that already have data mixed in with the OS and apps, you can add an additional storage device and move the data store and log files to separate partitions on the newly added device. Ideally, you should follow these guidelines for maximum safety, scalability, manageability, and performance.
OS and applications: Put these on the C partition.
Transaction logs: Put these on the D partition. Do not put transaction logs on the same physical device as the Exchange data store databases. If that device fails and you lose the database, you'll also lose the ability to recover data that was created after the tape backup from the Exchange transaction logs. I’ve seen teams that used this risky design pay dearly when they lost a whole day’s worth of data after a storage device failed. Their tape backups managed to restore everything up to the previous night, but losing a day’s worth of e-mails in a large corporation is potentially a job-threatening mistake.
Data store: Put the data store on its own physical drive or block-level storage device, such as a SAN running on Fiber Channel or iSCSI. Better yet, use Exchange 2000 and break the data store up into multiple chunks along departmental lines that you put on separate physical RAID partitions. The more separate physical partitions you use, the faster Exchange performs. This is due to the fact that a RAID partition can seek only one item at a time without jumping around—at a heavy cost to performance. Two physical devices can seek two items at once without jumping back and forth between tasks, and every physical partition you add gives you another simultaneous data seeker with a linear performance gain.
In practice, this data separation could easily and cheaply be accomplished on typical midsize servers with hardware RAID and six drives. Three pairs of drives can be configured for RAID 1 mirroring to create three physical partitions. The first partition can be broken down into logical C and D partitions for OS/apps and log files, respectively; the second two physical partitions can be set as the E and F partitions for housing multiple data stores. Higher-end solutions can use external SAN-based solutions with a similar approach to RAID setup. This opens the possibility of clustering your Exchange 2000 server. Do not just lump all six drives into one massive RAID 5 array and chop it up into multiple logical partitions.
While this method is cheaper because only one drive is lost to redundancy, it comes with a horrible expense to seek time (a three-fold loss in seek times to be exact, because all six drives must act in unison). Hardware and disks are so cheap now that the savings are nominal. In general, database applications are best described as death by a thousand tiny requests. Simultaneous seeks capability is heavily favored over the improved sequential transfer rates that RAID 5 offers.
For Exchange 2000, put three data stores on the E partition and three data stores on the F partition. This approach makes recoverability and maintenance a snap. If you put one large data store on a single partition, you can't safely use more than half the space on that partition. If you need to compact a database during maintenance, or if you needed to do a database repair in the event one of your data stores gets corrupted, the minimum free space on that partition must be equal to the size of the data store you're compacting or repairing.
Three data stores on one partition means the need to reserve 25 percent of the partition for maintenance or repair jobs. Database corruptions are common, so having six relatively smaller data stores makes repairs six times faster and easier. Additionally, the five remaining stores are not affected when the sixth data store is being compacted or repaired, allowing zero downtime for most of the company. I highly recommend Exchange 2000 or up because of this feature, especially for large companies and/or companies that require 24/7 availability.
Before you start, be sure your Exchange server is in full operational order with all database stores, antivirus, backup agents, and any other add-ons installed. To image a system, you generally need to dump your image onto a separate physical partition from DOS mode, because you can't image a boot partition while that partition is loaded.
The easiest way to do this is to dump an image to a network file share. To avoid writing a 10-page chapter on how to create a TCP/IP network boot disk with SMB client capabilities, I’m going to recommend that you simply go to bootdisk.com. That site is a one-stop shop where you can freely download premade images to make boot disks that pretty much work with all common network adapters.
Then, you only need to make a few minor modifications to the drive-mapping batch file to mount your network drives onto a drive letter. I recommend that you make a bootable CD image of the modified floppy disk to vastly improve boot times.
From there, you simply boot up the CD with the network drivers and automatically map the network share, which should also contain a recent copy of Norton Ghost or PowerQuest Drive Image (PQDI). Next, you simply run your imaging software and dump the C partition onto the network share. I'd also create an image backup of all the log file and data partitions, with just the database structure and no data.
Although imaging the data partitions is not mandatory, it will save you a lot of trouble by recreating the entire partition and database directory structures when you're doing a complete restore that starts from scratch with a new piece of identical or flushed hardware. The bare-bones database structure partition images will be very small because they're almost all compressible.
Newer versions of Ghost or PQDI support a new, hot backup feature that tracks new changes to the OS and applications while the system is operational. Note that you must first create an initial image of the C partition in DOS mode. Then you track any changes to a production server, even while the system is running, by backing up all deltas to the OS and applications. This is a valuable imaging feature because it's not feasible to down your server just to do a cold image backup every week or so.
Up to this point, our discussion has been focused on backing up the OS/application partition using disk-imaging software. Backing up Exchange data is equally important. Anyone serious about performing hot backups of an Exchange server must use a reliable third-party enterprise solution that hooks into Exchange with a backup and restore agent. One of the better solutions I've seen is from Legato. Some other solutions I've used were utter nightmares and never worked consistently, which usually meant someone’s head was going to roll.
For large enterprises that can’t afford to lose even an hour’s worth of data, I recommend that you go the extra mile of continuously making copies of the log files onto a separate physical device via some automated hourly batch process. Tape backup covers you up to the previous night, and log files can cover you to the last few minutes just before a database disaster.
Several types of disasters can hit an Exchange server, including database corruption or server corruption of the OS or Exchange application. In some cases, you may be able to recover from database corruption by running the repair operation on the affected data store or calling Microsoft and finding a way to fix a severe OS or application error on the Exchange server. However, if a system completely dies, following the guidelines I've put together can rescue your Exchange server.
If Exchange server can’t boot, or the Exchange Services refuse to load normally, proceed to the following steps.
- Boot the system with a DOS boot disk with network support (it can be the same one you used to make an image of the C drive).
- Load the C image with the image of the good Exchange server (from the network share where you keep all server image backups) onto the corrupted C partition and reboot when complete.
- If you were maintaining hot update backups with PowerQuest Delta Deploy or Ghost, apply those updates to your OS and applications. Reboot if necessary.
If you haven't lost the data partition or corrupted the database, then everything should be okay. The Exchange server should recognize and mount the database automatically. Up to this point, the whole process can be done in 30 minutes.
If you've lost data, or you're doing a bare metal restore, you need to reimage the data partitions with the bare data structure images. If you didn’t bother to do that, you'll need to create the identical partitions and directory structure manually, which may be very difficult. (Remember, if you had put the data on a separate physical device, the odds of this happening simultaneously with an OS/app failure are very slim.) Reboot if you reimaged that log and database partitions.
Once rebooted, your Exchange server will be fully operational with an empty database. Invoke the data recovery application and proceed to recover data from tape. If you were wise enough to have copied the log files to another physical server/device before the disaster, now would be the time to copy those files back and apply them to recover data up to the last couple of minutes before the Exchange server crashed.
You rarely would need to do a bare metal recovery of an Exchange server, but following the above procedure ensures that you have maximum recoverability of your business-critical Exchange servers. As a final precaution, you should keep an off-site tape backup of your images and database. This also goes for any other critical server or application that you back up via an image.
Save time and your sanity
The above procedure is the easiest and most reliable way of recovering from an Exchange disaster. Of course, this methodology can basically apply to any server or application. Some administrators may ask, “Microsoft doesn’t support this, does it?” or note that Microsoft does not support disk imaging. The truth is, I use disk imaging on all my servers, and I've never been turned down for support from Microsoft, nor have I been asked if I used disk imaging at all.
I even maintain a library of generic system images for every type of system configuration I have so that I can deploy or redeploy any new server rapidly. I've never been turned down for support of this configuration. Microsoft support is actually one of the more reasonable solutions out there. Where else can you get unlimited attention until an issue is resolved for $250? Additionally, Microsoft has already announced its own image deployment strategy with a new API called ADS (Automated Deployment Services), which has already won the support of major players. The bottom line is that disk imaging makes a lot of sense, especially as an alternative to the convoluted method of Exchange recovery that Microsoft recommends.