Disaster Recovery

10 things that can go wrong with your data recovery plan

Your data recovery plan may seem ironclad, but don't count on it. Justin James has learned the hard way that there are many possible points of failure.

I've learned that when it comes to those "million-to-one chance it will fail" scenarios, things will go wrong when they will do the most damage. Data recovery plans are a great example. I've seen some really well put together plans fall apart due to mistakes both minor and major. Here are 10 things that can sink your data recovery plan -- and what to do to avert these problem scenarios.

1: Bad backups

When you are desperate to get your operation back online, nothing is worse than the sinking feeling you get when you discover that your backups are no good. In this day and age of 24/7 computing, it is often hard to get good backups. Lots of applications just do not seem to cooperate well with backup software. Sometimes, the backups themselves are stored improperly, which causes all sorts of issues. And of course, there are problems with overly complex backup applications, settings that do not work as expected, and hardware issues. All these factors conspire to produce backups that are not what we need when we need it. By monitoring your backup systems closely and testing them on a regular basis, you ensure that they will work when you need them most. And when they aren't working, you need to make fixing them a top priority.

2: No way to restore

All the backups in the world aren't worth a hill of beans if they require you to have a live CD or some other way of bootstrapping the restore process, and you do not have that available. You should, of course, discover this in your dry-runs. But you also need to make sure that the restore system is always handy. Putting a copy of it with the backups is a good idea.

3: Lack of a post-recovery testing plan

Ever restore a system, only to discover days or weeks later that there are continuing problems? I have, and it stinks. In the case of system or application issues, the root cause (like a virus) may be lurking in those backups. After you perform your restoration, you need to perform two major types of tests: those that verify that the general systems and applications are back up to snuff and those that check that the specific issue that triggered a restoration is resolved. The former needs to be put into place, written up, and published and practiced long before it is needed. The latter is typically determined on the fly as the situation warrants.

4: No hardware to recover to

Some people assume (or hope) that the disasters we recover from are software only (viruses, OS meltdowns, etc.). And their hardware purchases reflect it. The fact is, if you do not have a full system to restore to, one that matches the system you need to restore closely enough that a bare metal restore will work, you do not have a full recovery process. You have merely made a large gamble that your hardware never fails!

I understand completely how this happens; hardware is expensive and it is difficult to justify buying two servers when you need one. That's one reason why I like to buy servers in batches, so I can have one fully redundant spare that can substitute for many others. If I ever suspect that the original hardware is bad, I can transfer it to the spare server quickly to verify that the issue is related to the server hardware. Expensive? Not in comparison to the cost of downtime waiting for new servers or parts to be delivered if I don't have a spare.

5: Lack of essential components

There are certain essential components you should have on hand, "just in case." But I've seen a number of shops, especially some of the ones with tighter budgets, overlook these in their kit. Basic items you should always have on hand include:

  • Spare network cables, at least a few of every length you currently use
  • Power cords
  • Hard drives of the size and types your servers need
  • Spare RAM chips of the size and types your servers need
  • Extra drive cables
  • Spare drive controller cards, if they are separate from the motherboards in your servers
  • Extra keyboard, mouse, and monitor

6: Never did a dry-run

One of the most repeated but least followed pieces of advice is to practice your recovery plan in advance. There are lots of reasons why people skip this, but it usually boils down to a lack of time. The good news is it is not too hard or time-consuming to give your recovery plan a trial, especially if you have spare servers handy. Whatever the holdup is, work through it and test your recovery process.

7: Unable to selectively restore from backups

It is really frustrating to need only one small file from a huge backup but to be forced to restore the entire backup just to pull out that file. As we shift to backing up virtual machines and not raw file systems, this is getting more common, too. Before you feel comfortable with your recovery plan, you should make sure that restoring individual files, even if they are within a virtual machine, will work. Otherwise, you can experience much more downtime than needed.

8: Lack of depth in backups

Few of us have the unlimited budgets needed for every backup to be a unique snapshot that gets archived permanently. We need to rotate media on some sort of schedule. There is nothing wrong with that, as long as the schedule provides us with the depth and redundancy we need.

I keep three days of backups as "nearline" backups on a rotating basis. Once a week, I transfer a nearline backup to disk, and once a month one of those disks goes offsite permanently. In addition, I have the Exchange server do its own backup twice a day, which gets saved in nearline on the same schedule. I also have SQL Server performing its own backups once a day, which get saved nearline and retained for 14 days. This enables my organization to quickly and immediately get back online; we restore the entire VM from a known-good spot and then use the Exchange or SQL Server backups to bring it up to date.

9: Offsite backups are too far offsite

There's this underlying assumption that if the offsite backups are ever needed, it will take time to be ready to use them anyway, so it does not matter if they can't be easily accessed. Well, that's usually true, but not always. Sometimes, you absolutely need those offsite backups, and when you do, you will need them right away. Online backups are a convenient alternative to sending physical media offsite, but just remember that your connection will feel mighty slow if you need to download a massive backup set just to pull a few files out of it. Make sure that whatever you use for offsite backups, you can access them easily.

10: No documentation in print

It's important to have your restoration process documented. But you know what folks often forget? If your systems are down, you may not be able to access your files! For example, we keep a SharePoint site for all our network documentation. But if the SQL Server is toast, how are we going to get to SharePoint? That's why you need to keep printed copies of the documentation you might need, preferably near the physical media (along with any live CDs or other restore materials). And you need to keep the printed copies up to date. One reason I like putting this material in SharePoint is that I can subscribe to an RSS feed of the documents list and get notified when any items change.

Been burned?

Have you had a data recovery operation blow up on you because of some small mistake or major oversight? Share your worst recovery nightmares with TechRepublic members.

About

Justin James is the Lead Architect for Conigent.

10 comments
pgit
pgit

I admit being a bit lax in the documentation. My assumption is I will always be available and I will always be doing any recovery work. I figure this will align the cosmos to keep me alive and well. In fact it's my only "insurance plan." It's worked so far. :) I'll take this article as a kick in the trousers to get writing that plan... BTW I actually do use the "asteroid scenario" as the ultimate fail point when explaining my work to clients. My off site is on average 50 miles from the original data and hardware, so it'd take a perfectly placed car sized asteroid or Tsar Bomba to foil my systems. In that case plan B is whip out the heirloom seeds, water filter, short wave radio and attempt to make contact with any survivors. I do NOT expect any of them will be asking about any data recovery.

a.portman
a.portman

A St. Louis area company had a backup data center at #2 WTC. All they will say about their new backup site is that it is more than 150 miles from St. Louis. They measure transactions in the thousands of dollars a second. Down time is not an option.

a.portman
a.portman

The DR plan should be tested by someone who did not write it. I will remember that you need to do "this" that I did not write down. Someone else did not. Your plan may need to be executed by someone else if the network administrator was in the data center and is now unavailable. The other thing is what would you do if you have your equipment but loose key personnel overnight. I have a friend who learned early one morning that his network administrator had been a bad boy. Net Admin was in custody and the FBI wanted every computer he touched, i.e. all of the servers. The FBI took images of the servers and Net Admins workstation and laptop. They were down for a day. Net Admin? He won't be touching a computer ever again.

ian
ian

I was a BCP/DRP coordinator for the last 10 years of corporate life in a multi-national company. We had three main groups (Wins, Unix, MF) responsible for their own backup, but all under the umbrella of the DRP/BCP group. My area was MF but we worked closely with the other two ( they had server rooms within the datacenter) and ran DRP/BCP tests offsite every six months.. From my experience, I can only reiterate your points. Business continuity (preferred over disaster recovery) means test, test, test. Is the data recoverable, is the data current, do I have the means to recover? #4. whether you have the hardware at hand or not, you should at least have configurations documented as well as where to obtain that hardware. This is certainly true if you need to move location and isn't limited to IT hardware. #6. Not only do a dry run, but document what you did (whether it worked or not) and the time each step took. You will find with each dry run that the procedure gets easier. Make that document into a script, and have one for ech scenario. When your client asks for ETR, you can be quite accurate in your answer. #9. 30 miles is the minimum distance for offsite. Anything closer and you could be in the same disaster zone. Direction of offsite storage is also a factor. Are you and your storage in the same line for tornadoes or hurricanes? #10. If not printed, at least available on an external storage that you can attach to a laptop. Add a printer here too.

HAL 9000
HAL 9000

When the floods came through it went the way of everything else in the area under 20 feet of water and totally useless. But it was handy and of less use as if it had of been on another planet. At that time we didn't get wet but it brought home the problem of Off Site Backup that the Boss was happy paying for. :D Col

Charles Bundy
Charles Bundy

is proportional to the risk at hand as well as to ease of recovery. We were required to do "offsite" storage and at the time my priority tree had "fire at the datacenter building" as most probable with "asteroid strike" a close second. :) In the latter case I would not be around to worry about tapes and in the former I determined that a building across campus (~1200 feet) was "offsite."

tomtomk
tomtomk

A good idea obviously is to have a recovery plan. All people necessary for the successful implementation of that plan need to have a copy. The plans should include contact numbers for all critical people both inside and outside the organisation and the responsibilites of each person. The DR/BC/Recovery plan could be saved on a USB key with all necessary people having one.

ian
ian

about having someone else proof read scripts and documentation. We used to take junior ops to Sterling Forest, overseen by a senior op, and have them run recovery. It was good experience for the juniors, good for seniors because you learn so much more from teaching and leading, and it was good for our plan too which was always evolving. Something else most people don't realise, depending on the disaster, you may need some new personnel. If a disaster affects your family as well as work, where is your allegiance?

pgit
pgit

That's a great idea... I could even put scripts on there with instructions for running them. What a simple, elegant idea, thanks!

Charles Bundy
Charles Bundy

Unless I'm in public safety or national defense, Family will always be my number one priority.