By Mike Talon
When it comes to disaster recovery planning, we all fear the worst-case scenario. You plan for everything you can think of, and you consult colleagues for issues you may have overlooked. But despite your best efforts, something still slips through the cracks and fails during an emergency.
Services are down, data is lost, and everyone is looking directly at you. If this unfortunate event happens to you, how do you handle it?
In this situation, the first thing you need to do is figure out the extent of the damage. For example, maybe the high availability systems failed, but the replicated data is still intact. If so, the systems will be down longer than anticipated, but all the data is perfectly safe.
If both systems failed, check to see if the tape backups are intact. Some data is better than none at all, so you can at least restore to the last good copy.
If nothing worked, you still need to assess the amount of damage. You can use the investigation to determine exactly how much data and what systems were lost.
Your next step is to get the systems up and running as quickly as possible using whatever data you still have. While this step may sound obvious, keep in mind that you'll need to do this while other data systems are coming back online—and while management is breathing down your neck demanding answers.
In this situation, it's vital to prioritize and proceed in the order demanded by that priority. It's also important to disregard nonpriority issues as much as possible.
Do your best to defer all witch-hunts until after the emergency is over. If management wants to insist on immediate disciplinary action, remind them that this will only prolong the recovery process even more.
After you've brought the systems back online with as much data as possible, call a staff meeting and perform a post-mortem to figure out what went wrong and why. Gather as much evidence as you can, and prepare an explanation for the managers who are about to demand answers.
If you have the time, compile everything into a report. If this isn't feasible, at least pull together a few key facts—using as many nontechnical terms as you can. Take this opportunity to explain why the system didn't work, and discuss how you can prevent it from reoccurring.
The failure of a DR system during an emergency is never a pleasant experience. Knowing how you'll handle the situation and proactively preparing for how you'll proceed will allow you to adapt on the fly while keeping your business up and running.
Mike Talon is an IT consultant and freelance journalist who has worked for both traditional businesses and dot-com startups.