Last week, I was having dinner in a popular chain restaurant—you’d
recognize the name—and noticed that our table’s server was particularly
agitated. When we were ordering, he was frantically writing down everything we
said twice, and we had to repeat ourselves. I watched as he passed by the
terminal near the kitchen where the servers normally entered orders into the
house computer system, and he handed a slip to someone in the kitchen, and then
frantically scribbled a third copy of
our order.

Later, there were harsh words in the kitchen that we could
hear in the dining room. Intrigued, I later asked the server what was going on.
The computer system had crashed, it seemed, and management had given the
servers contradictory instructions about the order slips they needed to fill
out. Making matters worse, handfuls of specially numbered slips had been given
to each server, and only after an hour and a half of activity were the servers
told that these numbered originals had to be accounted for—the number itself was
not sufficient, for some reason—and they’d have their pay docked if they didn’t
account for them. Servers had to root through garbage to save their paychecks.
Service was slow; the servers undeservedly suffered for this (we diners tend to
take it out on the server, even if the slowness is the fault of management or
kitchen staff), and by the time the smoke had cleared, four of them had either
been fired or quit that night.

This example will certainly seem simplistic when set beside
your own disaster recovery tales, but its simplicity speaks to the most
often-neglected factor in disaster recovery planning: accountability for human
failure in the chain.

The price of passing the buck

In the era of Enron, we no longer accept that a certain
amount of politics and managerial malfeasance is bundled into the cost of doing
business. It should be even less so in the area of disaster recovery, which is
not simply a matter of moving assets from here to there, but can be
survival-critical.

Put simply, your disaster recovery system must be scapegoat-proof. When an employee at any
level, from the lowliest assistant all the way up to the CIO, obscures his own
failure to act as part of a disaster recovery plan or (worse yet) redirects blame
to an individual who acted properly, it does more than obscure incompetence and
damage morale. Your company’s functionality and the integrity of its most
immediate asset are placed at risk, and for the most disingenuous of reasons.

Your disaster recovery plan, then, must be as impervious as
possible to politics, scapegoating, and buck-passing. If a human failure occurs
in the process, and it isn’t caught because an individual was able to hide the
failure, then your recovery plan is insufficient. You must have objective,
reliable means of enforcing accountability among the human participants in the
system, or your disaster recovery plan could be inadequate.

Account for accountability

Most disaster recovery plans assume that all participants in
the process share a common desire to valiantly see the crisis through with
fierce determination. Such plans focus on technical preparations and
rapidly-executed procedures intended to restore operations. It never occurs to
the authors of such plans that there may be people in the loop who are less than
dedicated to the cause, or who will be entirely unscrupulous if they screw up
their part. Looking back at the restaurant incident, it’s a fair bet that
whoever set up their recovery procedure didn’t give much thought to the
vulnerability of the food servers and the discretionary powers of floor
managers.

It’s important, then, to be certain that all human decisions
and tasks in a disaster recovery process be documented and auditable for later
evaluation. Moreover, this documentation process needs to effectively hold all
parties accountable in the long run, such that blame for a mistake or
performance failure cannot be shifted from one person to another.

How do you establish objective, auditable accountability?
Here are some ways you might consider.

  • Specify all tasks objectively and set
    up a documentation mechanism for each.
    Often, in a disaster recovery
    plan, a process is simply placed under a department or manager’s
    jurisdiction. There is no specification in the plan itself of individual
    recovery chores, those accountable for their completion, or provision made
    for documenting the completion of individual tasks. Take these extra
    steps! When mistakes are made, you’ll be able to unambiguously trace the
    problem and reinforce your plan.
  • Publish and distribute the entire plan,
    including the roles of all parties regardless of rank.
    When you know
    what’s expected of you, and you know that everyone else also knows what’s
    expected of you, you’re far more likely to do what’s expected of you. This
    egalitarian approach keeps everyone on the same page. If you’re the CIO,
    set a good example. If you’re not, get the CIO’s buy-in.
  • Establish explicit human failure
    contingency throughout the plan.
    A good plan is self-aware; it knows
    that it, too, can fail. So fallback procedures will be in place for
    partial system recovery, if full system recovery turns out to be
    impossible or protracted. The same principle applies to human decisions
    and human tasking in the plan! If a manager is tasked, as part of the
    recovery, to accomplish a portion of the plan and fails to do so, the plan
    should specify a fallback procedure. Since the plan is published, and all
    other participants know the consequence of any other participant’s failure
    to come through, another incentive to keep the ball moving is built in.

Leaving a trail

The idea, then, is to capture human events as objectively
and reliably as the digital ones, and it’s easy enough to do. Have no fears
that you’ll have to shell out yet another small fortune for this extra layer of
documentation—you probably have suitable tools lying around. Here are a few
improvisational suggestions.

  • Make your disaster log a database, and
    embed triggers to report entries.
    Having people report in as the steps
    of the recovery plan are executed is well and good—some tasks are so
    immediate that voice contact by phone or personal contact is essential,
    but requiring that every completed task be documented removes all
    ambiguity and narrows accountability. E-mails are not enough. A paper form is not enough. If your disaster log
    is a database, even a simple and temporary SQL-based utility, you have the
    capacity to set triggers that are task-specific, user-specific, or both,
    and can set notifications for follow-on recovery steps in motion directly from the reporting of
    completed tasks. If employees know that their task sets other tasks in
    motion, and only their timely reporting of completion to the disaster log
    will ensure the continuation of the recovery, they’re going to be diligent
    and their actions will be well-documented.
  • Got enterprise architecture, or application
    integration? Piggyback on its messaging system.
    In the event of
    crashes that are system-specific and not universal, it is
    possible to embed a recovery procedure as a workflow on a redundant server
    or a server known to be robust. If specific systems crash, yet your
    network prevails, you can do recovery tasking and reporting via your
    enterprise messaging system. This gives you all the benefits mentioned
    above and can make your recovery process much faster.
  • Don’t just e-mail; require hot-link
    responses to validate response.
    If your disaster recovery plan
    requires activity across physical distances because of the nature of your
    network and placement of your servers, and participation in recovery spans
    those distances, e-mail may be an appropriate messaging medium. However,
    it’s not enough to send out e-mail alerts. Your recovery coordinator must
    receive acknowledgments, and e-mail is a poor means of verifying the
    specifics of task completion. Instead, embed links in those e-mails that
    will result in user-specific log entries on your redundant server. Such
    entries will be task-specific, user-specific, and well documented.