Your data recovery plan may seem ironclad, but don't count on it. Justin James has learned the hard way that there are many possible points of failure.
I've learned that when it comes to those "million-to-one chance it will fail" scenarios, things will go wrong when they will do the most damage. Data recovery plans are a great example. I've seen some really well put together plans fall apart due to mistakes both minor and major. Here are 10 things that can sink your data recovery plan -- and what to do to avert these problem scenarios.
1: Bad backups
When you are desperate to get your operation back online, nothing is worse than the sinking feeling you get when you discover that your backups are no good. In this day and age of 24/7 computing, it is often hard to get good backups. Lots of applications just do not seem to cooperate well with backup software. Sometimes, the backups themselves are stored improperly, which causes all sorts of issues. And of course, there are problems with overly complex backup applications, settings that do not work as expected, and hardware issues. All these factors conspire to produce backups that are not what we need when we need it. By monitoring your backup systems closely and testing them on a regular basis, you ensure that they will work when you need them most. And when they aren't working, you need to make fixing them a top priority.
2: No way to restore
All the backups in the world aren't worth a hill of beans if they require you to have a live CD or some other way of bootstrapping the restore process, and you do not have that available. You should, of course, discover this in your dry-runs. But you also need to make sure that the restore system is always handy. Putting a copy of it with the backups is a good idea.
3: Lack of a post-recovery testing plan
Ever restore a system, only to discover days or weeks later that there are continuing problems? I have, and it stinks. In the case of system or application issues, the root cause (like a virus) may be lurking in those backups. After you perform your restoration, you need to perform two major types of tests: those that verify that the general systems and applications are back up to snuff and those that check that the specific issue that triggered a restoration is resolved. The former needs to be put into place, written up, and published and practiced long before it is needed. The latter is typically determined on the fly as the situation warrants.
4: No hardware to recover to
Some people assume (or hope) that the disasters we recover from are software only (viruses, OS meltdowns, etc.). And their hardware purchases reflect it. The fact is, if you do not have a full system to restore to, one that matches the system you need to restore closely enough that a bare metal restore will work, you do not have a full recovery process. You have merely made a large gamble that your hardware never fails!
I understand completely how this happens; hardware is expensive and it is difficult to justify buying two servers when you need one. That's one reason why I like to buy servers in batches, so I can have one fully redundant spare that can substitute for many others. If I ever suspect that the original hardware is bad, I can transfer it to the spare server quickly to verify that the issue is related to the server hardware. Expensive? Not in comparison to the cost of downtime waiting for new servers or parts to be delivered if I don't have a spare.
5: Lack of essential components
There are certain essential components you should have on hand, "just in case." But I've seen a number of shops, especially some of the ones with tighter budgets, overlook these in their kit. Basic items you should always have on hand include:
- Spare network cables, at least a few of every length you currently use
- Power cords
- Hard drives of the size and types your servers need
- Spare RAM chips of the size and types your servers need
- Extra drive cables
- Spare drive controller cards, if they are separate from the motherboards in your servers
- Extra keyboard, mouse, and monitor
6: Never did a dry-run
One of the most repeated but least followed pieces of advice is to practice your recovery plan in advance. There are lots of reasons why people skip this, but it usually boils down to a lack of time. The good news is it is not too hard or time-consuming to give your recovery plan a trial, especially if you have spare servers handy. Whatever the holdup is, work through it and test your recovery process.
7: Unable to selectively restore from backups
It is really frustrating to need only one small file from a huge backup but to be forced to restore the entire backup just to pull out that file. As we shift to backing up virtual machines and not raw file systems, this is getting more common, too. Before you feel comfortable with your recovery plan, you should make sure that restoring individual files, even if they are within a virtual machine, will work. Otherwise, you can experience much more downtime than needed.
8: Lack of depth in backups
Few of us have the unlimited budgets needed for every backup to be a unique snapshot that gets archived permanently. We need to rotate media on some sort of schedule. There is nothing wrong with that, as long as the schedule provides us with the depth and redundancy we need.
I keep three days of backups as "nearline" backups on a rotating basis. Once a week, I transfer a nearline backup to disk, and once a month one of those disks goes offsite permanently. In addition, I have the Exchange server do its own backup twice a day, which gets saved in nearline on the same schedule. I also have SQL Server performing its own backups once a day, which get saved nearline and retained for 14 days. This enables my organization to quickly and immediately get back online; we restore the entire VM from a known-good spot and then use the Exchange or SQL Server backups to bring it up to date.
9: Offsite backups are too far offsite
There's this underlying assumption that if the offsite backups are ever needed, it will take time to be ready to use them anyway, so it does not matter if they can't be easily accessed. Well, that's usually true, but not always. Sometimes, you absolutely need those offsite backups, and when you do, you will need them right away. Online backups are a convenient alternative to sending physical media offsite, but just remember that your connection will feel mighty slow if you need to download a massive backup set just to pull a few files out of it. Make sure that whatever you use for offsite backups, you can access them easily.
10: No documentation in print
It's important to have your restoration process documented. But you know what folks often forget? If your systems are down, you may not be able to access your files! For example, we keep a SharePoint site for all our network documentation. But if the SQL Server is toast, how are we going to get to SharePoint? That's why you need to keep printed copies of the documentation you might need, preferably near the physical media (along with any live CDs or other restore materials). And you need to keep the printed copies up to date. One reason I like putting this material in SharePoint is that I can subscribe to an RSS feed of the documents list and get notified when any items change.
Have you had a data recovery operation blow up on you because of some small mistake or major oversight? Share your worst recovery nightmares with TechRepublic members.