For obvious reasons, IT organizations around the world have spent the last few months creating or reviewing their disaster recovery plans. They have reviewed their physical and virtual security, contacted backup hosting and data centers, and worked on contingencies for immediate office space, in case the entire organization had to move to a new location.
That’s the kind of world we live in today, and an IT manager would be foolhardy not to plan for such events.
However, what if you were confronted by a small disaster, something that’s more than an inconvenience but less than a total failure? Is there a gap in your planning? Do your procedures allow for only two possibilities: “business as usual” and “start from scratch”? In this column, I’m going to discuss a small disaster that happened recently at TechRepublic and the questions it raises about the planning and flexibility needed by technical managers.
It all started with some idiot on a tractor…
Last Friday morning, we lost our electricity at TechRepublic’s main offices for almost an hour. Apparently, somebody with a backhoe cut a power cable and put a couple of city blocks in the dark. In the engineering building, the UPS in the server room kicked on, and the operations guys started shutting down the various servers. Unfortunately, before they could completely shut down the Exchange server, the UPS died. When power was restored and the techs brought the Exchange box back up, they discovered database corruption, and attempted to repair the errors.
When the lights go out, know who has the flashlight
Over the next couple of weeks, we’re going to be telling you what happened in more detail, for a couple of reasons. First, we learned some things during the recovery efforts that we want to pass along. Second, you’ve told us that you like reading about these kinds of problems.
In this column, however, instead of focusing on the specifics of what went wrong, and how our operations folks fixed it, I want to concentrate on what this kind of minor crisis teaches us about the need for planning and flexibility.
After all, the power was off for just an hour. None of our offices was actually damaged. In the minds of most of our employees, it just meant a wasted morning—until e-mail stayed offline for the rest of the day. As most of you know, e-mail is the mission-critical application for many organizations, and TechRepublic is no exception.
It ended up being a real problem, and yet it wasn’t serious enough to trigger the company’s formal disaster recovery plans. What do we call it: a small disaster? A minor train wreck? Whatever the term, IT managers have to be able to respond.
How to prepare for a “minor disaster”
Looking at what our operations team had to go through, here are some factors for you to consider when fighting through your own “small emergencies”:
While e-mail is vital to any IT organization, the telephone also comes in mighty handy. Take a minute and think of everyone our operations people had to contact:
- The landlord
- The electric utility
- The hardware manufacturer
- The tape backup support department
- Microsoft technical support
- Plus anyone else they could think of who’d encountered a similar problem in the past
Faced with the same kind of crisis, could you easily come up with all the contact information necessary? Consider a lightning strike that overpowers your UPS and fries some of your boxes. If your organization is like many others, you not only have to support a variety of hardware manufacturers, but you also have to consider a mixture of equipment that you own outright and other types of equipment that you lease. Obviously you have to treat the latter differently.
Internal client needs
Here is a tough test for you. Go into your server room and point to a box at random. Now assume that the box will be offline for eight hours. Do you know who in your organization would be affected by such an outage? Probably not. Do any of your people? You’d better hope so. Even in midsize companies, most internal networks are too complicated for any one person to understand the dependencies of each piece of hardware or every software application.
Keeping everything else running
While it would make life much simpler to be able to focus all your efforts on the current problem and let everything else fall by the wayside, that’s not practical. You’ve got to be able to juggle a number of balls.
When the Exchange server goes down, you’d like to have more than one Exchange expert on the premises. All IT departments are stretched thin these days, but it’s worth the effort to cross-train staff, particularly on mission-critical equipment and applications.
If a flood put your data center under water, chances are that everyone would understand that it could take some time to get an alternate data center up and running. People would cut you some slack. When the Exchange server is down, on the other hand, and everything else is working normally, people are more inclined to pick up the phone and holler, “I need Outlook—now!” That’s the paradox: Small crises breed impatience. While you might not be able soothe your internal clients, try to stay calm yourself.
Join the discussion
In a future article, we’ll be talking more about the specifics of what happened to our Exchange server and how our operations folks at TechRepublic responded. In the meantime, we’d like to hear how you handled a “small disaster” at your shop. Just drop an e-mail or post a comment to this column.