Seemingly simple things, like recovery and availability, become unaccountably complex when we leave the happy confines of our cubes. Sometimes this complexity stems from a variance between the scope of our visions—between IT and users, for example; after all, we have a much more granular understanding of the system than most users. Other times, though, it comes from a misunderstanding of what is important to whom. I learned this truth the hard way after one particularly brutal disaster recovery fiasco.
My team, a group of junior architects and senior engineers working on a network deployment, received an “all call” at 4 A.M. MST. Fortunately we were still awake, working hard on some problems with our test-bed message backbone. I phoned the call center manager. The main data center was on fire. All of the backbone services shut down. He needed hands to help get something working before people on the East Coast got in to work.
I sent half my team back to the hotel. The other half hit the keyboards. We lit phone lines up for hours, fielding questions and checking services. After the first few hours of chaos, we had enough of the system running and stable for people to limp along. Simultaneously, the call and data center teams worked like maniacs to get applications online in the secondary data center. When the other half of my team woke up, I put two on the phones. The rest went down to help the other teams.
In short order, we reconfigured the mail system so that it did not rely on our central hub server, complicating the mail routing but restoring service. File/print services were mostly local, with the exception of a handful of large data repositories and some ancient shares on an old NetWare server that was currently slag. It took several days to get ERP (Enterprise Resource Planning) back to full functionality.
Despite the initial confusion, the IT teams as a whole felt pretty good about the whole affair. We responded to a difficult situation quickly, focused restoring service in successive waves, and even kept the corporate executives/workers informed of our progress though a newly-designed phone tree system. Some things could certainly have been done more effectively. But given that we are human beings, that did not come as a huge surprise to anyone.
Our relative euphoria began to crash when we heard the muttering. Clients were getting angry because they could not get information. Most of the users from order entry to product shipping struggled to find their data. Some openly questioned if IT had actually done anything in the last three days.
I called up one of the IT user group leaders (let us call him Dave). Dave told me that there was a movement to burn us in effigy around the company. It seemed that, in all of our work, we missed something vital to the workers—a set of spreadsheets on the NetWare server that everyone used to process orders. While we congratulated ourselves, our users could just barely do their jobs.
Success and failure at the same time
Without getting into a discussion of the “shadow ERP,” this experience taught me several important lessons about recovery operations. Some of them came from what we did right, others from the obvious mistakes we made.
First on the list of positives, our plan of restoring services in “layers” helped to manage our users’ perception. We insured that first and foremost they had basic services (login, local printing, local and regional e-mail). We then worked on restoring what we thought of as critical site-to-site communications. Only after the users could do their jobs and clients place orders in a rudimentary fashion did we shift to restoring full services.
Second, our staff management plan worked very well. We broke the team into overlapping twelve hour shifts, insuring good coverage and that no one person was on the floor too long. We also shifted our staffing around as needed. A handful of people worked on the secondary data center at first, and the rest of the team focused on answering calls, spreading information, and dealing with the service layers. As things calmed down we shifted staff until almost everyone worked on the secondary data-center floor.
Third, our unrehearsed and ill-planned communications strategy spread data though the environment at a ferocious rate. We took advantage of formal and informal distribution channels. Even the users that wanted us lynched honestly appreciated that the recovery process was so open to them. They received real information on a regular basis, making it a great deal easier for them to trust us.
Unfortunately not everything worked as well as we wished. Take our layered approach. We successfully restored services that we thought were important. But we did not have the good sense to talk to our users before the disaster to discover what they wanted restored first. We guessed right sometimes. Sometimes we did not. Why guess, when the users could just have told us what services they considered the most critical?
The waves of information we released into the environment sometimes conflicted with each other. The three managers (myself, the call center manager, and the data center manager) did not always synchronize our information before releasing it. We should have either appointed one of us to be the primary communicator or tossed it up one level to the CIO. However, he was so busy with the executive team and the COO (the fire damaged more than the data center) that we did not want to bother him.
Beware of the shadow (ERP)
Finally, the shadow ERP bit us. We focused on restoring the formal ERP. But most organizations have a secondary, shadow, ERP comprised of linked spreadsheets and desktop databases that manages a great deal of the day-to-day work. Ours lived on an old, unprotected NetWare file server. Although corporate policy usually prevents us from formally supporting this shadow, recovery sometimes requires us to break policy long enough to restore business function. We honestly should have moved the shared components over to a server we could back up long before the problem arose.
Although dozens of other factors arise in business continuance (legal, operational, and political), the core concepts of layer of restoration, rotating staff, fast coordinated communication, and identifying our users’ (rather than our own) key services have helped me though worse disasters than the one described above.