At 9:13 A.M. on Feb. 1, 2002, NASA Flight Director LeRoy Cain realized that the space shuttle Columbia had encountered catastrophic disaster. "Lock the doors," he ordered. This command specified that all personnel in Mission Control immediately institute appropriate communications protocols and secure their local data for later analysis. The idea was to optimize the capture of information that would help to explicate the details of the disaster later and minimize communications inaccuracies that would obscure the investigation and possible recovery efforts.
When your system goes to hell, your goals are essentially the same. You want to know what happened, how it happened, where you stand, and what it will take to get things back on track. You have a contingency manual that gives you a detailed to-do list. But you also have the power to "lock the doors" and improve the odds of containing damage and keeping users afloat. The suggestions below are things you may not have thought of—some of them perhaps little things—that may help you get a handle on the situation a little faster and streamline communications before they become mired in panic.
Open the phones
When the system crashes (and especially when it's a really bad crash), people will want to know what's happening and they'll want to know immediately. Similarly, certain individuals will require some contingent instruction, depending on the nature of the crash, when several courses of action are possible and they're not sure what to do.
The worst thing you can have at this point is a bottleneck. Information must travel upward and downward in the company hierarchy in terms of letting users and managers know what's going on. You must allow for several different lines of communication to make sure that word of the crash, its severity, and other relevant information disseminates quickly. There's nothing worse, when you're a user, than having everything go dead on you without warning and then finding that there's no way to learn what's going on or how long it will take to fix.
When putting this in place, resolve to issue updates to the user community on a fixed schedule, if necessary, and assign someone to see to it.
Do you know where your apps are?
It's a business reality that all applications are not created equal. One thing you may not have thought of in writing your disaster recovery procedures is to assign priority to your various applications. Some are more critical to your company's business continuity than others. Some are real-time, some are transactional, and some are archival. See to it that the key applications are prioritized ahead of time.
When disaster occurs, be certain that the tasks of local application recovery have been initiated by the users or IT personnel responsible for them. This involves getting data back online, of course, and getting user systems up and running. But there are also the critical matters of recovering data that was soft at the time of the crash, synchronizing aggregate data (if more than one database is involved in real-time or transactional apps), and posting data collected via interim procedures during the downtime. If these matters are not carefully attended to, you can make a bigger mess after the crash than the crash itself created.
Call your partners
If you're running in an ERP environment, you have transactional or even real-time networking between your company and your business partners. Notify these partners of the interruption of your systems immediately! As with local applications, the extended apps you share with your business partners will have to resynchronize later, and you want to minimize possible loss of transactional data during downtime. Since this will probably be the responsibility of the partner companies that send data to your systems (or expect to receive data from you), the sooner you let them know, the better.
Does your disaster recovery process include mirror sites/servers? If so, you're well prepared for catastrophe and probably have a high commitment to business continuity. You can reroute critical applications to a mirror system (probably off-site) within seconds of a major in-house crash.
Here's what else you can do: bring your users up on the mirror system in a controlled fashion. Do so according to application priority, as mentioned above, and have a verification procedure in place as you bring up each group of users.
Why do you want to do this? For a couple of reasons. First and foremost, it is unlikely that this mirror system gets much real-time use and to suddenly throw 1,500 users at it with the turn of a switch may work in simulation but would court a second disaster in the real world. Taking a few short steps rather than one big leap allows you to monitor the transition and catch potential problems before they hit. Second, if there is a problem transitioning users to the mirror system, it will occur within a priority hierarchy: the user groups servicing the highest-priority apps go first and are back up and running within minutes, and, if the mirror system wigs out on you above, say, 150 users, then the most important apps are back up while you figure it out, leaving lower-priority work undone.
Is there a hacker in the house?
Was it an incidental crash, a stupid mistake, or is there an intruder bringing you to your knees? This is probably the most critical question that must be answered when the lights go out.
Learning to identify a hacker attack and the appropriate countermeasures are best covered elsewhere, but, understand that if it's a hacker attack, there are several lock-the-door steps you must take immediately.
- Shut down the attack. Isolate the system if you must by disabling the network. If it's server-specific and you know which server it is, take that server down. Are you logging server activity, and are you doing so on a remote server? (This is a very good practice, by the way.) Then, identify the attacker immediately, if possible. You may be able to isolate the path of attack, shut that path down, and get the system back up within minutes.
- Log the IDs of all remote users and secure those logs immediately, in case the hacker decides to try to wipe them before you get around to it. If you can't identify the attacker immediately, be certain you have a detailed activity log to turn to later. Even a sly hacker often leaves a trail.
- Watch your local server-level users. Do you have local users with high-level security access—either manually or via apps—to your servers? The "hack" could be accidental; an application could be going haywire and causing damage or the user could be making unwitting mistakes. Know which users/workstations these are, and check this out immediately if there is no other readily-apparent source of attack.
Finally, whether the crash is a hack attack or some other disastrous event, consider tooling for system imaging. This is one of the tools available to NASA's mission controllers when reconstructing a tragedy like the Columbia disaster. What is the benefit of system imaging? Once your apps are back in business, you can reconstruct what the exact state of your system was at the time of the crash. This will help you diagnose not only what went wrong but what weaknesses exist in your system. Remember that the ideal recovery is not just the reestablishment of operations but the establishment of a system that is more stable and secure than the one that blew up.
Scott Robinson is a 20-year IT veteran with extensive experience in business intelligence and systems integration. An enterprise architect with a background in social psychology, he frequently consults and lectures on analytics, business intelligence and social informatics, primarily in the health care and HR industries.