In previous columns, I laid out a classification
system
for the most common types of disaster recovery (DR) situations, and
last time I focused primarily on what to do when there is proof of a network
intruder
. Now let’s deal with what happens in more traditional DR
situations:

  • Level 3 – You lose minor amounts of data or a
    non-critical system fails
  • Level 4 – You lose a large amount of
    non-critical data or a critical system fails

Level 3 disasters involve minor data loss—perhaps due to an
incomplete restore from backup tape, or the loss of non-critical systems. When
this type of disaster occurs, speed is usually less of an issue. End users can
continue to do their jobs without this data and/or without these systems, but
this doesn’t mean your staff doesn’t have to get them back up and running, or
find out what was lost. You will need to first figure out what went wrong, and
ensure the damage is contained. This may mean verification of backup systems
for other data-systems, test restorations of controlled and previously
backed-up data, and the determination of what caused the system failures. Your
goal here is to make sure that you will not lose data or suffer the long-term
loss of a critical system. Once you have contained the problem, you can begin
to address it. This may mean rebuilding the impacted systems as quickly as
possible and restoring all known-good data, running anti-virus and/or other
security measures to clean the systems and data, and performing other measures
to bring your systems back.

Tips in your inbox

How well can your organization deal with an emergency? The Disaster Recovery newsletter helps you protect your valuable data.

Automatically sign up today!

Level 4 disasters are a bit more time sensitive. This is an
instance in which large-scale data loss is discovered, or when one or more
critical systems are taken offline. In these cases, you don’t have time to move
methodically, but you must absolutely proceed with extreme care whenever
possible. Failure to do so could result in a recurrence of whatever caused the
disaster in the first place, only leading to more downtime. You will be forced
to immediately restore any and all data that you can ensure is not corrupt, and—if
you have some form of high-availability solution—you must allow your critical
data-systems to fail over and resume operation. Initially, you will be acting
fast to restore as much of your data and services as quickly as you can, so
that end-users can resume working with those systems while you find out what
went wrong. In Level 4 disasters, you do not carry out a complete investigation
until after the restoration of service.

That being said, you must be as careful as possible while
restoring services. Moving too fast could easily result, not only in a
recurrence of the disaster due to your staff missing some critical fault, but
could actually compound the problem. If you are rushing too much, misconfigurations
or accidents could occur that cause even more damage. Move quickly, but stay in
control of the situation at all times, no matter how loudly the executives are
screaming to get everything back up immediately. If you have failover systems, perform
a quick check to ensure that you have a stable platform at your DR site, and
then restore operations. If the platform isn’t stable, you can make the changes
necessary to begin the data-restoration process, preceding a return to service.
Either way, this emergency calls for an acute awareness of your systems’ health
as you move forward.

For both Level 3 and 4 disasters—after you deal with the
initial disaster, you will have to determine exactly how much data was lost, so
that end users can begin the job of manual recovery, where possible. This may
mean re-entering data from hard copy, alerting clients to the loss, and
preparing the proper regulatory reports. None of that can happen, however,
until you are able to figure out what was lost and what is still recoverable.

Data-loss disasters are never easy to deal with. The urgency
generally pressed upon IT staff during such outages can make for more mistakes,
allow intruders to get back into the network, and generally open the door to
new disasters. Working quickly and methodically while all around you goes
haywire may sound like the toughest job in the world, but it will ensure that
you get your systems back up and running, and that you are able to restore as
much data and service as you can in the end.