Last week, I was having dinner in a popular chain restaurant—you'd recognize the name—and noticed that our table's server was particularly agitated. When we were ordering, he was frantically writing down everything we said twice, and we had to repeat ourselves. I watched as he passed by the terminal near the kitchen where the servers normally entered orders into the house computer system, and he handed a slip to someone in the kitchen, and then frantically scribbled a third copy of our order.
Later, there were harsh words in the kitchen that we could hear in the dining room. Intrigued, I later asked the server what was going on. The computer system had crashed, it seemed, and management had given the servers contradictory instructions about the order slips they needed to fill out. Making matters worse, handfuls of specially numbered slips had been given to each server, and only after an hour and a half of activity were the servers told that these numbered originals had to be accounted for—the number itself was not sufficient, for some reason—and they'd have their pay docked if they didn't account for them. Servers had to root through garbage to save their paychecks. Service was slow; the servers undeservedly suffered for this (we diners tend to take it out on the server, even if the slowness is the fault of management or kitchen staff), and by the time the smoke had cleared, four of them had either been fired or quit that night.
This example will certainly seem simplistic when set beside your own disaster recovery tales, but its simplicity speaks to the most often-neglected factor in disaster recovery planning: accountability for human failure in the chain.
The price of passing the buck
In the era of Enron, we no longer accept that a certain amount of politics and managerial malfeasance is bundled into the cost of doing business. It should be even less so in the area of disaster recovery, which is not simply a matter of moving assets from here to there, but can be survival-critical.
Put simply, your disaster recovery system must be scapegoat-proof. When an employee at any level, from the lowliest assistant all the way up to the CIO, obscures his own failure to act as part of a disaster recovery plan or (worse yet) redirects blame to an individual who acted properly, it does more than obscure incompetence and damage morale. Your company's functionality and the integrity of its most immediate asset are placed at risk, and for the most disingenuous of reasons.
Your disaster recovery plan, then, must be as impervious as possible to politics, scapegoating, and buck-passing. If a human failure occurs in the process, and it isn't caught because an individual was able to hide the failure, then your recovery plan is insufficient. You must have objective, reliable means of enforcing accountability among the human participants in the system, or your disaster recovery plan could be inadequate.
Account for accountability
Most disaster recovery plans assume that all participants in the process share a common desire to valiantly see the crisis through with fierce determination. Such plans focus on technical preparations and rapidly-executed procedures intended to restore operations. It never occurs to the authors of such plans that there may be people in the loop who are less than dedicated to the cause, or who will be entirely unscrupulous if they screw up their part. Looking back at the restaurant incident, it's a fair bet that whoever set up their recovery procedure didn't give much thought to the vulnerability of the food servers and the discretionary powers of floor managers.
It's important, then, to be certain that all human decisions and tasks in a disaster recovery process be documented and auditable for later evaluation. Moreover, this documentation process needs to effectively hold all parties accountable in the long run, such that blame for a mistake or performance failure cannot be shifted from one person to another.
How do you establish objective, auditable accountability? Here are some ways you might consider.
- Specify all tasks objectively and set up a documentation mechanism for each. Often, in a disaster recovery plan, a process is simply placed under a department or manager's jurisdiction. There is no specification in the plan itself of individual recovery chores, those accountable for their completion, or provision made for documenting the completion of individual tasks. Take these extra steps! When mistakes are made, you'll be able to unambiguously trace the problem and reinforce your plan.
- Publish and distribute the entire plan, including the roles of all parties regardless of rank. When you know what's expected of you, and you know that everyone else also knows what's expected of you, you're far more likely to do what's expected of you. This egalitarian approach keeps everyone on the same page. If you're the CIO, set a good example. If you're not, get the CIO's buy-in.
- Establish explicit human failure contingency throughout the plan. A good plan is self-aware; it knows that it, too, can fail. So fallback procedures will be in place for partial system recovery, if full system recovery turns out to be impossible or protracted. The same principle applies to human decisions and human tasking in the plan! If a manager is tasked, as part of the recovery, to accomplish a portion of the plan and fails to do so, the plan should specify a fallback procedure. Since the plan is published, and all other participants know the consequence of any other participant's failure to come through, another incentive to keep the ball moving is built in.
Leaving a trail
The idea, then, is to capture human events as objectively and reliably as the digital ones, and it's easy enough to do. Have no fears that you'll have to shell out yet another small fortune for this extra layer of documentation—you probably have suitable tools lying around. Here are a few improvisational suggestions.
- Make your disaster log a database, and embed triggers to report entries. Having people report in as the steps of the recovery plan are executed is well and good—some tasks are so immediate that voice contact by phone or personal contact is essential, but requiring that every completed task be documented removes all ambiguity and narrows accountability. E-mails are not enough. A paper form is not enough. If your disaster log is a database, even a simple and temporary SQL-based utility, you have the capacity to set triggers that are task-specific, user-specific, or both, and can set notifications for follow-on recovery steps in motion directly from the reporting of completed tasks. If employees know that their task sets other tasks in motion, and only their timely reporting of completion to the disaster log will ensure the continuation of the recovery, they're going to be diligent and their actions will be well-documented.
- Got enterprise architecture, or application integration? Piggyback on its messaging system. In the event of crashes that are system-specific and not universal, it is possible to embed a recovery procedure as a workflow on a redundant server or a server known to be robust. If specific systems crash, yet your network prevails, you can do recovery tasking and reporting via your enterprise messaging system. This gives you all the benefits mentioned above and can make your recovery process much faster.
- Don't just e-mail; require hot-link responses to validate response. If your disaster recovery plan requires activity across physical distances because of the nature of your network and placement of your servers, and participation in recovery spans those distances, e-mail may be an appropriate messaging medium. However, it's not enough to send out e-mail alerts. Your recovery coordinator must receive acknowledgments, and e-mail is a poor means of verifying the specifics of task completion. Instead, embed links in those e-mails that will result in user-specific log entries on your redundant server. Such entries will be task-specific, user-specific, and well documented.
Scott Robinson is a 20-year IT veteran with extensive experience in business intelligence and systems integration. An enterprise architect with a background in social psychology, he frequently consults and lectures on analytics, business intelligence and social informatics, primarily in the health care and HR industries.