Testing failover systems and
backups has always been a touchy subject that most administrators would rather
sweep under the rug. The whole idea behind a failover system is that if the
primary component fails, the secondary system should automatically engage.
Theoretically, this means you should be able to just pull the plug on the
primary system and watch the secondary system instantly take over. Even so, we
all know that things don’t always work the way they’re supposed to, and it
takes a lot of nerve to pull the plug on a healthy production server in hopes
that the failover system will work correctly.

If you’re an adrenaline junkie, the pull-the-plug-and-pray
technique might be just what the doctor ordered. For the rest of us, I
recommend taking a more controlled approach to testing failover systems.
Unfortunately, I can’t give you an exact technique because everyone’s network
configuration is different. But I’ll share some recommendations that should
help most people.

Backup and restore capabilities

When it comes to testing failover systems, I recommend
starting out by testing your backup and restore capabilities. After all, can
you imagine what would happen if you accidentally trashed a server during your
testing, and then found out that your backup was invalid and therefore couldn’t
be restored? The idea sounds farfetched, but I’ve seen it happen on more than
one occasion.

The easiest way to test your backups is to get your hands on
a spare server that isn’t connected to the network. Install a tape drive into
the server, and you’ll be able to experiment with restoring backups without
having to worry about putting duplicate server names or duplicate IP addresses
onto the network.

As you experiment with restoring backups, there are a few
things you need to keep in mind. First, if you’re restoring the operating
system as part of your test, Windows may not boot if the hardware on your test
server is different from the hardware on the corresponding production server.
It might be necessary to restore the backup and then manually install Windows
on top of the restored copy just to get the server to boot.

Another thing to consider: Restoring a tape to a test server
doesn’t necessarily mean that the backup is good. The true test of a restore
operation is to make sure the data is valid and accessible. For example,
if you’re restoring an Exchange server, you should make sure that all of the
Exchange-related services are able to start, and that you’re able to mount the
various databases. You might also do a spot check to make sure the appropriate
permissions still apply. Only then can you rest assured that you have a good
backup.

Failover testing

Once you’re confident in your backups, it’s time to begin
testing failover equipment. Start small by testing your routers.
If you have redundant routers, you should be able to pull the plug on one
router and have data packets automatically take an alternate path to their
destination. I recommend that you start with routers because it’s a simple test
you can do without jeopardizing your servers.

Before you conduct any other types of tests, take steps
to minimize the damage that could occur should something go horribly wrong.
First, schedule the tests for a time slot when they will be the least disruptive,
such as on a weekend, holiday, or late at night. If you decide to do some late-night
testing, don’t forget to schedule the tests in a way that doesn’t interfere
with your nightly backup. You should also send out an announcement a week or
two ahead of time that the network will be unavailable during your tests, and
remind people of this just prior to the testing.

When the time comes to test the failover systems, start by
shutting down any services that aren’t absolutely essential. Basically, this
means that Windows should be running, but none of your server-level applications
should be. Now you can test your failover systems without fear of
trashing your data. Test things such as your uninterruptible power supplies and
cluster failover support. If everything appears to work correctly, and you’re
feeling brave, restart all the services that you shut down earlier and try the
tests again.

Once failover has occurred, check to make sure that all data continues
to be accessible and that the systems are functioning normally (from a user’s perspective).
Hopefully, your tests have gone smoothly; if not, don’t worry about it too
much. After all, that’s what tests are for. It’s better for you to find out
about a problem during testing than during an actual disaster.