This article is from TechRepublic's Disaster Recovery e-newsletter. Sign up instantly to begin receiving the Disaster Recovery e-newsletter in your inbox.
In disaster recovery planning, it's a good idea to review definitions and disaster levels from time to time. By developing such a baseline, you can make sure everyone in your organization is on the same page when it comes to classifying a disaster and determining a solution.
Let's examine disaster levels and define common terms that apply to DR solutions. Keep in mind that these are only suggestions based on industry information; your company can adapt your guidelines and definitions as necessary.
Define the disaster
Not every disaster takes out your entire data center. Many disasters are on a smaller scale and impact only one or two systems, if any. While there's no official industry standard, I've devised the following scale for rating the level of disasters based on a British military code for defining the levels of engagement.
Threat impact and analysis: In such a scenario, someone claims to know a back door or have his or her finger on the button of a virus. This situation requires tightening security and intercepting the attacker. However, the organization has not yet incurred damage, and no breach is currently in progress.
Minimal damage event: This situation has zero impact to data systems, but it's still an issue you must deal with. For example, even if a security breach has allowed an intruder to gain sensitive information, data systems are often still running. However, you must address this situation immediately.
Single-system failure: A single data system goes offline for more than a few minutes (or any length of time, depending on the system's criticality). This situation necessitates immediate failover to local backup systems if possible; otherwise, you must restore from tape to backup hardware. In general, this scenario doesn't substantially impact business, but you must address it ASAP.
Single critical failure or multiple noncritical failures: In this scenario, an immediate threat to business operations has occurred, but your data center is still up and running. Recovery to alternate hardware and/or local failover are still options, but response time is now vital. This is the level that wide-scale virus attacks may fall into, so containment and infection recovery may also be necessary.
Imminent or actual data center failure or a larger failure: Power failures, espionage, terrorism, and natural disasters fall into this category. Remote location failover or rebuilding of data centers using tape-based backup data are your only options; this level assumes that the production facility will be unusable for a reasonably long period of time.
Know the terms
It's important to be able to classify a disaster, but you must also be familiar with the terminology for solutions so you can best determine a course of action. Decision makers must understand the following three terms when discussing protection and restoration of data and data systems.
This is the process of moving data to another location for eventual restoration. This term does not include any methodologies for immediate availability; rather, it refers to solutions such as off-site storage of tape backup and replication to a data vault in another location.
This refers to failing over one or more data systems to an immediately available, same-site hardware resource. For example, if one database server fails, you can immediately bring another physical machine online in the same data center.
Because these solutions require underlying technologies for data replication, you employ DR technologies as part of these solutions in almost all cases. This is also the place where nearly all clustering technologies come into play.
Taking the concept of high availability to the next level, this refers to failing over data systems to a mirrored environment in a different physical location. In general, this includes performing routing changes via DNS, WINS, etc., so client computers can connect to these resources on different subnets and in different physical locations.
This type of solution typically takes longer to fail over due to these routing concerns, but solutions of this type are nearly always a great deal faster than any form of restoration technology.
Once again, these are only suggestions for defining both disasters and the solutions you can use to protect against them or avoid them. However, you should also keep in mind that when discussing business continuity planning, you must customize these definitions according to your organization's needs.