How the Help Desk Failed the Enterprise during Disaster Recovery

I've recently been consulting for a company that had a minor disaster. At least the disaster was minor in that the corporate building was rendered unusable by an Act of God, but the server room managed to survive without a single drive being lost to heat or uncontrolled shutdown.

I had read the company's disaster recovery plan, such as it was — a copy and paste off a well known Internet template — and I knew there'd be problems if anything ever happened. Then lightning struck, literally, late on a Friday night.

In spite of the fact that this $1+ billion company had no workable, tested recovery plan, they managed to restore network connectivity to loaner PCs at an alternate work site and for all intents and purposes, services were restored by Monday morning.

So where the help desk fail the enterprise? Let me enumerate the ways: 

 (1) No printers mapped. There were printers aplenty in the alternate work space, but when they imaged the loaner machines, the support techs didn't bother to install a single printer.  Furthermore, none of the printers that were available were labeled with IP addresses or names, so the only way you could look one up was by make and model.  While veteran IT people weren't bothered by such trivial matters, the "typical" end users were lost.  The "add printer" wizard might as well have a Russian-language interface.  Eventually one of the tech guys came around with a 3.5" floppy disk and ran a script that installed printers on PCs in the various work areas.  It was a little late, a little lame, but eventually users were able to print. Lesson learned:  If you can get PCs installed and networked, take five more minutes per machine and install the closest printer.

(2) No phones or voicemail.  There were phones in the alternate work site, but the ratio was one phone to five or six employees, none of the original extensions worked, and no one knew what the new extensions were.  It was every bit of two business days before someone got around to compiling a list of where people were sitting and communicating the list out.  Meanwhile, third parties calling the old direct-dial numbers were getting fast busy signals or an outgoing message stating that the number was no longer in service.  Lesson learned:  Make restoration of the phone system a recovery priority. If you can't restore full functionality of the phone system, at least work with the phone company to reroute incoming calls as quickly as possible. 

(3) Minimal communication.  The managers of various teams tried their best to communicate facts about system status and when core applications and shared drives would become available. The problem was the line of business managers weren't getting timely reports from the IT people managing the recovery.  As a result, people were making up wild stories based on rumors about the state of the system.  That lack of communication didn't instill confidence in the end users that IT knew what it was doing.  Lesson learned:  No matter how busy you are rebuilding servers or restoring connectivity, someone in the IT support organization has to be the "point person" for communication with the lines of business.

If you're lucky enough to work for organizations that know the value of a good business continuity (disaster recovery) plan, you may be chuckling at the lessons learned I've presented as "lessons that should be obvious."  For those of you who aren't so lucky, take note. You never know when lightning is going to strike, but you should assume that it will, eventually.


Editor's Picks