Testing failover systems and backups has always been a touchy subject that most administrators would rather sweep under the rug. The whole idea behind a failover system is that if the primary component fails, the secondary system should automatically engage. Theoretically, this means you should be able to just pull the plug on the primary system and watch the secondary system instantly take over. Even so, we all know that things don't always work the way they're supposed to, and it takes a lot of nerve to pull the plug on a healthy production server in hopes that the failover system will work correctly.
If you're an adrenaline junkie, the pull-the-plug-and-pray technique might be just what the doctor ordered. For the rest of us, I recommend taking a more controlled approach to testing failover systems. Unfortunately, I can't give you an exact technique because everyone's network configuration is different. But I'll share some recommendations that should help most people.
Backup and restore capabilities
When it comes to testing failover systems, I recommend starting out by testing your backup and restore capabilities. After all, can you imagine what would happen if you accidentally trashed a server during your testing, and then found out that your backup was invalid and therefore couldn't be restored? The idea sounds farfetched, but I've seen it happen on more than one occasion.
The easiest way to test your backups is to get your hands on a spare server that isn't connected to the network. Install a tape drive into the server, and you'll be able to experiment with restoring backups without having to worry about putting duplicate server names or duplicate IP addresses onto the network.
As you experiment with restoring backups, there are a few things you need to keep in mind. First, if you're restoring the operating system as part of your test, Windows may not boot if the hardware on your test server is different from the hardware on the corresponding production server. It might be necessary to restore the backup and then manually install Windows on top of the restored copy just to get the server to boot.
Another thing to consider: Restoring a tape to a test server doesn't necessarily mean that the backup is good. The true test of a restore operation is to make sure the data is valid and accessible. For example, if you're restoring an Exchange server, you should make sure that all of the Exchange-related services are able to start, and that you're able to mount the various databases. You might also do a spot check to make sure the appropriate permissions still apply. Only then can you rest assured that you have a good backup.
Once you're confident in your backups, it's time to begin testing failover equipment. Start small by testing your routers. If you have redundant routers, you should be able to pull the plug on one router and have data packets automatically take an alternate path to their destination. I recommend that you start with routers because it's a simple test you can do without jeopardizing your servers.
Before you conduct any other types of tests, take steps to minimize the damage that could occur should something go horribly wrong. First, schedule the tests for a time slot when they will be the least disruptive, such as on a weekend, holiday, or late at night. If you decide to do some late-night testing, don't forget to schedule the tests in a way that doesn't interfere with your nightly backup. You should also send out an announcement a week or two ahead of time that the network will be unavailable during your tests, and remind people of this just prior to the testing.
When the time comes to test the failover systems, start by shutting down any services that aren't absolutely essential. Basically, this means that Windows should be running, but none of your server-level applications should be. Now you can test your failover systems without fear of trashing your data. Test things such as your uninterruptible power supplies and cluster failover support. If everything appears to work correctly, and you're feeling brave, restart all the services that you shut down earlier and try the tests again.
Once failover has occurred, check to make sure that all data continues to be accessible and that the systems are functioning normally (from a user's perspective). Hopefully, your tests have gone smoothly; if not, don't worry about it too much. After all, that's what tests are for. It's better for you to find out about a problem during testing than during an actual disaster.