So, you’ve made backup tapes, used replication tools, or
otherwise made sure that the data is safe. Now, when a disaster hits, you have
the tools at your disposal to bring your business back, or do you? Disaster
Recovery (DR) involves more than just having
the data—recovering from disasters takes a coordinated plan and quick action
before attempting to restore the data. The natural reaction to a disaster is to
dig right in to restoring data and services, but it’s important to be
methodical and think through the problem before taking action—it will usually
be a time-saver in the end, because you won’t be wasting time on actions that
turn out to be unneeded, or perhaps make the original problem worse.

Your first steps will be determined by the nature of the
disaster
—was it a virus, a physical mishap, or hacking that caused a
failure? There may be no reason to take further failover or restoration action
if the fault was caused by a bad stick of RAM; you could simply replace it if
you have a spare, reboot the machine, and move on. However, you’re not always
going to be that lucky. A failure of a processor, disk, or other component that
can’t simply be swapped out could cause much longer downtimes for recovery.
Keep in mind that the initial problem that you discover may be masking other
faults within the system, so don’t stop your
investigation prematurely. You should have a checklist of
possible culprits
, and methodically rule out each one, so that you know the
most direct way to proceed with recovery.

Next, you need to survey the full extent of the damage and
decide if you’ll be able to restore to original equipment, or if new servers
will need to be prepared. The extent of the damage can be anything from a quick-fix
for a hardware
issue to the need to replace entire data-systems. Your
timeframe for recovery will depend on what went wrong, and whether you’ll be
using existing disks or restoring from tape.

If you cannot restore to the original server, then
recovering to new servers could significantly increase downtime. Only
enterprise-level facilities are likely to have spare systems prepared for this
contingency. You will have to estimate your downtime and see what you need to
do to meet your service level agreement (SLA), which stipulates the length of
time that has been agreed upon before services are restored to end users. Because
of this responsibility, quickly determining if the old systems are salvageable
is your immediate goal; otherwise you could easily end up wasting even more
time trying to recover to a server that is a lost cause.

If your disaster is due to a virus attack, the data on the machines
is immediately suspect and may be unusable, even if it is technically intact. Even
though you can restore to the original server/s, you may have to restore from an
older copy of the data, one that predates the virus attack itself. Presumably,
this will be from tape, but you’ll have to try to guess when the virus struck
to ensure that you don’t end up restoring more corrupt data. If you’ve kept
point-in-time copies of the data on some other media—such as disk-based backup—you
can utilize these, but keep in mind that virus attacks can cross over
replication tools and snapshot boundaries as well, if they’re attached to a
live disk.

Once you’ve determined what happened and how to fix it
without causing further damage, you can then begin the process of restoration.
Replacing hardware and restoring data can get you back up and running with a
minimal disruption to other systems and your end-users. Of course, restoration
of data to the same or new systems is often much less expensive than failing
over to secondary systems, but limits you to being able to recover from only
isolated disasters. Next time, I’ll look specifically at what happens when
recovering to the original servers is not an option.

How well can your organization deal with an emergency? Automatically sign up for our free Disaster Recovery newsletter, delivered each Tuesday, and make sure you’re prepared for the next catastrophe.