So, you've made backup tapes, used replication tools, or otherwise made sure that the data is safe. Now, when a disaster hits, you have the tools at your disposal to bring your business back, or do you? Disaster Recovery (DR) involves more than just having the data—recovering from disasters takes a coordinated plan and quick action before attempting to restore the data. The natural reaction to a disaster is to dig right in to restoring data and services, but it's important to be methodical and think through the problem before taking action—it will usually be a time-saver in the end, because you won't be wasting time on actions that turn out to be unneeded, or perhaps make the original problem worse.
Your first steps will be determined by the nature of the disaster—was it a virus, a physical mishap, or hacking that caused a failure? There may be no reason to take further failover or restoration action if the fault was caused by a bad stick of RAM; you could simply replace it if you have a spare, reboot the machine, and move on. However, you're not always going to be that lucky. A failure of a processor, disk, or other component that can't simply be swapped out could cause much longer downtimes for recovery. Keep in mind that the initial problem that you discover may be masking other faults within the system, so don't stop your investigation prematurely. You should have a checklist of possible culprits, and methodically rule out each one, so that you know the most direct way to proceed with recovery.
Next, you need to survey the full extent of the damage and decide if you'll be able to restore to original equipment, or if new servers will need to be prepared. The extent of the damage can be anything from a quick-fix for a hardware issue to the need to replace entire data-systems. Your timeframe for recovery will depend on what went wrong, and whether you'll be using existing disks or restoring from tape.
If you cannot restore to the original server, then recovering to new servers could significantly increase downtime. Only enterprise-level facilities are likely to have spare systems prepared for this contingency. You will have to estimate your downtime and see what you need to do to meet your service level agreement (SLA), which stipulates the length of time that has been agreed upon before services are restored to end users. Because of this responsibility, quickly determining if the old systems are salvageable is your immediate goal; otherwise you could easily end up wasting even more time trying to recover to a server that is a lost cause.
If your disaster is due to a virus attack, the data on the machines is immediately suspect and may be unusable, even if it is technically intact. Even though you can restore to the original server/s, you may have to restore from an older copy of the data, one that predates the virus attack itself. Presumably, this will be from tape, but you'll have to try to guess when the virus struck to ensure that you don't end up restoring more corrupt data. If you've kept point-in-time copies of the data on some other media—such as disk-based backup—you can utilize these, but keep in mind that virus attacks can cross over replication tools and snapshot boundaries as well, if they're attached to a live disk.
Once you've determined what happened and how to fix it without causing further damage, you can then begin the process of restoration. Replacing hardware and restoring data can get you back up and running with a minimal disruption to other systems and your end-users. Of course, restoration of data to the same or new systems is often much less expensive than failing over to secondary systems, but limits you to being able to recover from only isolated disasters. Next time, I'll look specifically at what happens when recovering to the original servers is not an option.
How well can your organization deal with an emergency? Automatically sign up for our free Disaster Recovery newsletter, delivered each Tuesday, and make sure you're prepared for the next catastrophe.