Disaster Recovery

10 items for your disaster recovery wish list

We all have to address disaster recovery at various levels, but we typically must apply the technology to fit rigid parameters -- such as less cost or functionality -- instead of being able to do it right. But what if you didn't have any limitations to hold you back? Rick Vanover looks at some things that might go into building the perfect environment for meeting DR requirements.

We all have to address disaster recovery (DR) at various levels, but we typically must apply the technology to fit rigid parameters -- such as less cost or functionality -- instead of being able to do it right. But what if you didn't have any limitations to hold you back? How would you create the perfect DR model? Here are some things (however unrealistic) that might go into building the perfect environment for meeting DR requirements.

Note: This information is also available as a PDF download.

#1: The network is transparent

Providing transparent network connectivity is our number one challenge in making the ideal DR environment. If subnets for data center components were designed to be available across multiple locations without reliance on one piece in another data center, DR failover would be a breeze. Sure, a lot can be done to manage the use of a DR site through DNS and virtual switches -- but if those could be avoided for a more natural configuration, the process could be made easier and work in a more transparent fashion.

#2: The storage is transparent

Storage could arguably take the #1 spot on our list, since it's such a big pain in DR configurations. Technologies are available to handle storage replication and set up storage grids, but how many of us have the money to implement the functionality? The ideal DR storage system would also dispel any performance limitations when you're running the entire enterprise from the DR configuration. Limitations in performance may cause a selective DR, which makes for difficult decisions on what systems are truly required in the DR environment.

#3: Everything starts with DR in mind

How many times have you come across a system that began as a pilot or simple test, was promoted to a live role, and is singular in nature and can't scale? These are DR plan inhibitors. If all systems are designed with the DR concepts in mind, all systems can comply with the same DR requirements and be an easy transition for administrators.

This extends to the peripheral components as well -- storage, data recovery, networking, and access to the system should be created with DR in mind. But too many times, a system may have some but not all of the DR components in place. "Mostly compliant" with the DR model is still noncompliant.

#4: All areas of IT meet the same requirements

Have you ever been irritated by partial compliance with an enterprise DR policy? An example would be when one application meets a different standard of DR -- so maybe only a few clients can run the application in the DR configuration. Wouldn't it be great if the standing policy for the organization was to have full compatibility for the DR configuration? The ideal DR policy would provide funding and enforce the requirements for the DR configuration across all systems and groups within IT.

#5: Disaster recovery is performed in a few steps

How we get to a solid and robust DR configuration will vary widely by size and scope, but the perfect conversion to the DR would a quick and contained process that is identified in a few steps per system, or a few steps for the entire environment. With the DR configuration so accessible, this would also be a good opportunity to enforce regular intervals where the DR configuration is used.

#6: Documentation for failover to the DR site is clear and simple

An overly complex procedure to use a DR site can ruin the usability of the mechanism. The ideal DR environment has consistent and clear documentation that is practiced regularly so that there is no guessing in switching over to the DR model. In fact, regular use of the DR model can ensure that the remote DR site works as expected, keeps staff familiar with the procedure, and extends the life of primary systems by increasing idle time at the primary site.

#7: All data recovery is native

The most challenging part of DR is the data recovery process. If a data recovery model is patched together using various scripts, watchdog programs, or other solutions that are not native to a product's feature set, the risk of data corruption and DR failure goes up. The ideal DR model would have solutions built into the product that consider all parts of a solution, as many products use more than just a database to provide the overall application.

#8: Performance in the DR model isn't compromised

A comprehensive DR plan that meets all requirements from a design perspective yet can't handle the load is worthless. You don't want to have to decide which applications and systems are available at the DR site when you're in a DR situation. Limitations such as Internet connectivity, network bandwidth, shared storage throughput, backup mechanism availability, and storage capacity are all factors in gauging the overall performance for the DR site.

The perfect DR situation would be an exact inventory in the remote data center that models that of the primary data center. However, maintaining an equipment inventory in lockstep with another data center is nearly impossible. So the next-best solution would be a remote data center that meets or exceeds a performance benchmark set by the primary data center in all relevant categories.

#9: The user experience in the change-over is nothing more than a reboot (if that)

Managing the transition to the remote data center is difficult enough on the data center. But the user side of the transition should be made as seamless as possible. Strong DR plans and mechanisms frequently base technology on DNS names (especially CNAME records) that can be easily switched to reflect a new authoritative source for the business service. This can include standby application servers and mirrored database servers, as well as migration to new versions with the simple DNS change.

Managing the refresh or the caching of the names can be a little tricky, but either having clients reboot or run the ipconfig /flushdns command on Windows clients can usually refresh any caching. The same goes for server systems that are affected by a DR transition; they may need to refresh their own DNS cache, so the same configuration steps may need to be followed on the server platform.

#10: All things are possible for the small environment, too

The more robust DR configurations tend to present themselves naturally to the large enterprise. However, the small IT shops are at a resource disadvantage when it comes to architecting a comprehensive DR plan. The ideal DR model would be applicable to big and small environments, and all of the objectives could be reached with the small organization. Technologies such as virtualization have really been a boon for the small environment to achieve their DR objectives, and that frequently is the cost justifier for the initial investments in storage and management software.


We've identified the top 10 things we'd like to see baked into an ideal disaster recovery model. Share your pain points and things you would like to change about your disaster recovery model... if you could.

About

Rick Vanover is a software strategy specialist for Veeam Software, based in Columbus, Ohio. Rick has years of IT experience and focuses on virtualization, Windows-based server administration, and system hardware.

10 comments
jefferyp2100
jefferyp2100

Here's a real no-brainer, courtesy of the Bellagio Easter Sunday 2004 disaster in Las Vegas: Don't put the fail-over server in the same room as the primary server! That shouldn't have to be said. To be ready for a disaster, the fail-over servers must be in a separate physical location, not at the same site and certainly not in the same server room.

robo_dev
robo_dev

Assuming that there is a business impact that happens due to downtime, then a DR plan that is is underfunded or inferior is not an option. I don't agree with the argument of 'the dream DR plan'.....it's like an argument about 'your ideal parachute'. Your plan and the associated technologies either meet the requirements of the business, or they don't. If you have to cut corners, then it's your duty to make sure manangement knows that. It's great to have some nicely pretty formatted document with stated recovery time objectives, but can you really hit those numbers? Many organizations I've seen live in a dream-world where they think that with a bunch of tapes and an extra server or two they could recover in less than four hours from anything. They run some totally unrealistic annual test and continue in their delusion of competence. But when a real disaster strkes, they won't be ready, and the business impact will be severe. Either your plan flies, or it doesn't. If you've done the math, then you know exactly how much downtime your business can tolerate. From there your DR plan must be funded and built to meet that requirement.

cue.burn
cue.burn

While this may be necessary for some critical applications, not all applications need 100% performance in a true DR situation. Performance of 65% for a non-critical application is still much better than not working at all. However, providing 100% of the resources for a non-critical application in a DR situation can quickly become cost-prohibitive.

Craig_B
Craig_B

The list is fine however unless managment fully backs it you will not get the time/money to implement it and/or the rules will be bent so they can keep the business going. Later when a situation comes up that requires recovery, you will most likely get the heat about not being able to bring systems up or to full functionality, even though managment made the decisions to not have full DR in the first place.

jos
jos

When all "normal" things doesn't work: have intelligent people at hand. Many years I was the disaster recovery officer in a bank. About one time a year it happened that whole the system stopped and many times that happened at two o'clock in the morning :-(

B9Girl
B9Girl

"when a real disaster strkes, they won't be ready, and the business impact will be severe" This is very true. You just don't know how bad it will be until you've been through it. My home office was burglarized in broad daylight one month ago and I'm still putting the pieces back into place (mostly as needed). One of my more "high-maintenance" clients declared I should be up an running the next day, which was completely unrealistic logistically, and emotionally out of the question. Stolen were my custom built PC workstation and mac laptop. Pre-burglary, my "disaster plan" was having RAID on my computer and an external backup drive. The flaw(s) in my plan - my backup drive was smaller than the drives on my system so I didn't have a complete backup of my files (I had RAID, so who cares, right?). Also, my backup drive was sitting right on top of my computer. By chance, the thief left behind my external drive after cleaning me out of everything else. Here's where the rule of threes comes in to play, and I wish I had done more of this. Backup across three different media. RAID, DVD archiving system (what thief wants to be burdened with carrying a bunch of DVDs?), 2 external drives. I purchased an Apple "Time Capsule" as well. It's fantastic because we can hide the drive in a hard-to-pilfer area, but still access it via our wireless network. The last piece, don't skimp on insurance. If you have a home office, make sure your renter's or home owner's policy covers the actual equipment you have. I had a rider on my policy to cover the computers in my home office. Had I not done that, my policy would have only covered $1000 worth of home office equipment. Without insurance, it would be a double financial hit - both replacing expensive equipment, and extreme work disruption. Insurance at least gives you the opportunity to get back to work faster. Don't forget the human factor in disaster planning - I was in complete shock for several days after the burglary. Not only was my regular work disrupted, I also had to deal with the police and inventory issues. In a natural disaster, people will certainly be in shock and possibly dealing with injuries, death, etc. Work is the last thing on their minds.

deborah.lupica
deborah.lupica

I agree with you that "either your plan flies or it doesn't" and the ideal way to make sure it does is to do live tests - ones that simulate a disaster as close as possible. Ideally "pull the plug" on the datacenter. But I have found that the risk associated with this and the impact of this test on business users have made it mostly unacceptable. We are working towards that but if one could do this type of test every year at least - that would help make sure the plan flies. What is it they say "an untested plan is worse than no plan at all!" Deborah L.

Granville
Granville

I have a TLA for you...CYA (Cover Your...er...behind). Make sure that if any management decision that may affect your DR is "verbalised" then you follow up with an email or memo along the the line of "I understand from our conversation xyz. If this is not correct would you please clarify in writting." Oh! And take at least three copies! ;-)

Photogenic Memory
Photogenic Memory

It had to be something software related that was scheduledto run that hosed the system? It's got to be. Were you able to fix it? Good thing it was at 2am and not 2pm, LOL!

b4real
b4real

Agreed, I am a big fan of regular conversions to ensure that the DR site works and keeps everyone sharp on the process.