Data Centers

Protect the Network: What can happen when you don't test that DR plan

Lesson learned from disaster event

Of all the disruptions that your client may face—an attack by a skilled hacker, an employee breaching sensitive data, an ex-contractor copying a client list—the shutdown of a production system can be the most harrowing. For your client, it’s an obvious source of stress and a severe disruption of the business: Customers can’t enter orders, accounts receivable aren’t processed, and accounts payable can’t run.

While there are numerous disaster recovery plans on the Internet that can be used to structure a recovery effort, what is often overlooked is the plan’s functional use. In other words, does it work when it really has to? Does it have a detailed step-by-step plan to cover the recovery of all critical files? All good recovery plans should, but are the steps on your plan up to date, timely, and accurate?

Here’s a situation that happened recently with my company and the lessons we learned—the hard way.

Bad timing
A few months ago, a disk drive on a server was giving the guys in tech support problems. So they did what was written in the manual and replaced the bad drive with another one. The system had multiple servers. Everything ran well for a day or so, and then the disk that was replaced suddenly went bad. It is hard to believe, but the crash happened on the last day of the week, on the last day of the month, on the last day of the fiscal year. We couldn’t close the books for month-end or year-end until the systems were fixed.

We found out later that the disk drive had not been seated properly. Unfortunately, 50 percent of the day’s production workload was using this drive. Everything that was running was corrupted. We went back to our disaster recovery plan and followed procedures. It took us several days to get everything back up and running. What took so long? Our disaster recovery plan was thorough, but it had a few big flaws that could have been very costly.

"It’s on the spreadsheet"
Our company’s disaster recovery plan included guidelines for all types of scenarios. For example, the plan has 75 statements about what to do about outside hackers. However, it included only one line about the financial system recovery: “If financial systems crash, use the Excel spreadsheet to recover.”

Unfortunately, the spreadsheet hadn't been updated in nine months. We also found that the spreadsheet had references to people who had left the company six months earlier. To make matters worse, the recovery programs used to get the files reorganized had been stored in one of the programmer’s libraries. As it turns out, he had left the company and that library had been deleted.

Finally, we also discovered that while the recovery procedures were in writing, the programs they executed had been deleted. So the recovery programs had to be rewritten, taking more than two days to get the system back up.

Problems with the recovery document
Although our company has a very good Internet and intranet site, the disaster recovery plan focused on the Internet side—the hackers and crackers—and didn’t address the legacy systems. As mentioned before, the recovery for the legacy system was an Excel spreadsheet. This is what the spreadsheet contained:
  • Listings of all financial production system files
  • A determination of whether a file was critical
  • The update schedule
  • The name of the person responsible for the file
  • What was necessary to fix the file
  • The estimated time of repair

On the surface, this list seems relatively complete. It covers all of the steps in our recovery document process. It even passed an outside audit of our recovery procedures. But a closer look revealed plenty of holes. Here’s a breakdown:

Listings of all financial production system files
  • All were listed except for those added since the last update, six months prior.
  • All were listed except the files removed since the last update.
  • Files were listed that were there but that were no longer used by any system.

A determination of whether a file was critical
  • Some files were incorrectly identified as not critical. This meant that they were restored from backups and not rebuilt. This caused the loss of some daily sales data.
  • Some noncritical files were labeled as critical. This caused them to be rebuilt, which took hours to do and was unnecessary.
  • Some of the files weren’t identified as critical or noncritical. Here we were unsure whether we would add more problems or fewer if we ignored critical files that were incorrectly labeled.

The update schedule
The files were properly identified, but there was no information on how to rebuild the monthly files, just the daily and weekly files. We rebuilt the monthly files by writing new programs to roll up the weekly files into the monthly files. The monthly files are used for high-level reporting and, luckily, detail-to-the-dollar wasn’t needed.

The name of the person responsible for the file
  • Some staff had been terminated who were on the list.
  • New staff on the list had no idea how to restore the files for which they were responsible.
  • Some of the more complicated fixes were stored in libraries of terminated staff.

What was necessary to fix the file
  • When we did have programs to rebuild some files, we didn’t have documentation on how to do this. An assumption was made that the responsible person would know what to do.
  • The restore procedures we found didn’t cover all of the files that were corrupted.
  • Some of the files had indexes that had to be rebuilt. This wasn’t mentioned in the restore procedures.

The estimated time of repair
  • The times that were written didn’t take into account the extra volume of data we had collected.
  • Some files that were supposed to take 20 minutes to restore sometimes took an hour or more.

Timely, up to date, and accurate
Following this planning failure, I was faced with the task of getting all of the recovery processes redocumented and making sure that the programs that restored files actually existed and worked.

Our planning problem is a common one with disaster recovery procedures. I found a solution in the "Business Continuity Strategies" guidebook on the ennovate inc. Web site. The article points out that one of the steps of disaster recovery is to ensure that the written recovery procedures are up to date and accurate. The document we kept to restore the files was very simple and easy to use, but it had not been looked at in months. From a management standpoint, we could say that we were covered because we did have a checklist that went over the files that needed to be restored. The mistake was assuming that what was contained in the recovery document was accurate. This is where we fell short.

What I am doing now
I am currently in the process of getting this recovery process fixed. Even though I am working on only a small piece of the disaster recovery process, the time is worth it. Here’s how we’re correcting the problem:
  • I have taken the Excel spreadsheet and placed it in an intranet Web site that can be accessed by the systems personnel responsible for the specific piece of the system. Access is given to the individual systems using their normal account logon. Users can make changes to the recovery procedure only if they are on the team that would have to rebuild the system if it crashed.
  • An easily understandable set of general recovery procedures has been written. These are step-by-step instructions on how to do the actual recovery process. This allows a less-experienced member of the team to fix a specific file by executing a list of instructions one step at a time. At the end of the instructions, if all of the steps are completed correctly, then the file is considered correct.
  • The specific file-recovery steps have instructions that are linked directly to recovery scripts. For example, if a specific file is corrupted, the actual recovery script to fix that file can be executed from the intranet we have developed. This gives a person access to a script to fix the problem that has been written by experienced programmers who have already verified that the script works correctly. These scripts have also been tested at a noncritical time to ensure they work.
  • Any special recovery programs or scripts are linked directly to the file.
  • The manager of the system, not a programmer, is responsible for the individual files.
  • With every new program or file change in a production system, the programmer has to log on to our recovery system and make a note as to any impact this will have on recovery. This is very easy to do since the system is now online.
  • This piece of the recovery plan is now printed by the administrative assistant monthly and given to the VP of IT for his records.

Conclusion
We computed that the three days were we down prevented us from entering sales at $30 million a day for three days. We didn’t lose the revenue; we just didn’t have use of the money for three days, we couldn’t fill orders for three days, and employees were sitting idle for three days. Take 30 minutes and check your recovery plan and determine the value of what is written.
0 comments

Editor's Picks