Data Centers

Towering accomplishment: How one company rebuilt after catastrophe

A properly thought-out business continuity plan can be the difference between recovering from disaster and going down with the ship. Learn how one company survived September 11 and discover what it's doing to decrease future continuity risk.


By John McGrath, Tech Update

As I stood on the corner of Sixth Avenue and Houston Street and watched my office on the 85th floor of the north tower collapse that Tuesday morning, wondering if my coworkers were trapped inside, business continuity planning was the last thing on my mind.

Eight long hours after the collapse of the building, I learned that of the three people in the office at 8:46, all had escaped by racing down the 85 flights of stairs, only to narrowly miss being caught in the debris cloud when the south tower fell.

Our focus immediately after the attack was on the most important things—establishing communication and making sure that everyone in the office had survived. Once that was done, we quickly realized that despite the unbelievable circumstances, our business could not stand still. Messages were flooding in through our Web site from people concerned about our safety, and hundreds of people in our organization were dependent on us. So we all set aside our emotions and shock and started working again.

I was a developer for the eBusiness group of Thermo Electron, a global technology and manufacturing company. In the weeks after September 11, I worked alongside my managers Markus Leibundgut and Dom Wissmann, and the other members of our group, to rebuild our Web operations. Here's a look at how our original continuity plans worked under worst-case-scenario conditions and how the experience permanently changed how one company will plan for the future.

Thanks to moderate business continuity measures, like off-site servers, and some lucky coincidences with laptops, Thermo resurrected itself within a matter of weeks. But the company isn't leaving disaster recovery plans to half-measures or chance from now on.

Learning from losses
When the towers collapsed, Thermo lost its development servers, which were in the WTC office, as well as file and e-mail servers, desktop workstations, and all the rest of the equipment necessary to run an office. We also lost hundreds of hours of work that hadn't yet been backed up, including new staging templates I'd been working on, improvements to the database by our DBA, and a host of bug fixes and interface improvements we had been planning to migrate to our co-location facility that week.

However, the company had made some fortuitous choices in the past. The most important of those prior to the 11th was the decision to colocate most of the company's servers off-site. While we lost a great deal, the lion's share of the Web site and its content, representing thousands of hours of labor, remained intact in New Jersey.

The original impetus behind co-location was to protect against comparatively trivial problems, like power outages. But it was also economical, according to Wissmann.

"For most corporations, it makes sense to leave hosting to people who do it best. The systems you need, like truly redundant power and Internet access, are extremely expensive but relatively cheap as a marginal cost in a cohosted facility."

Co-location reduces some risks, but it just transfers others to a different place. Realizing this, Thermo now removes full data backups to an off-site location at least once a week.

"We do a much better job than we used to," Wissmann told me. "Until 9/11, we made nightly backups but left the tape there. Now we take it off-site—we're much more aware of what it really means to be redundant and backed up. It's cute to back up on a local tape, and it's useful if there's a grain of sand on the hard drive or whatever. But it won't help at all if there's a real event, potentially something as simple as a fire."

Off-site backup can be as basic and cheap as having someone bring a tape home once a week, or as complex as having full real-time mirroring of all data between two facilities. While the latter approach, done correctly, eliminates the possibility of data loss, it requires significant throughput and can be expensive. Leibundgut and Wissmann determined that that level of service wasn't required and that regular incremental backups would suffice.

Every organization needs to balance the costs of continuity strategies vs. the cost of possible downtime. For a big company, continuity costs can be massive. According to a recent Network World report, Coca Cola Enterprises, the bottling division of The Coca-Cola Company, will spend over $400 million on a disaster recovery/business continuity plan this year, the largest IT project in the company's history.

Thermo was lucky in many ways, because while the Web site was an important business tool, we weren't as reliant on real-time communication as some other companies. The site's primary purpose was to store the huge volumes of information the company has on the thousands of products sold by various Thermo business units. As Thermo increasingly relies on e-commerce and uses the Web for internal business processes, the disaster recovery and continuity planning will have to change to reflect the changing role of the Web in the organization.

For now, Thermo has chosen to rely on increased planning using mostly equipment and software already on hand—for instance, the development environment, housed in a separate facility than the production servers, can be converted to a stand-in production environment should the need arise. If a standby system is necessary, it doesn't have to be identical to the production system it's replacing—fewer or less expensive servers may be suitable, since a lower level of service can be acceptable as long as availability is maintained. This decision will vary greatly depending on the type of business being protected—for instance, a news organization might see Web traffic spike in the event of a major catastrophe and actually require greater capacity with its standby servers than its normal production servers.

Mobilizing the troops
Keeping a site running is one thing. Another crucial planning step is making sure the IT staff is able to keep working.

Wissmann estimates that on 9/11, his developers lost six to eight weeks of work when workstations and the local development environment were destroyed.

The loss would have been greater, but both Leibundgut and Wissmann used laptops as their primary computers, and both had taken them home the night of September 10th.

"It was sheer luck that Markus and I had our laptops," says Wissmann. "If we hadn't, we would have been pushed back half a year at least." The developers, though, lost "an incredible amount of personal business data on each machine."

After the 11th, Thermo purchased laptops for the entire staff, which allowed us to work remotely while new office space was found. Once in our new office, we were given backup space on a file server, the contents of which were automatically backed up to a remote location every night.

This system has already proven its worth, when a number of machines were stolen from the office last February. Despite the theft, data loss was minimized.

Larger offices may require formal arrangements for standby office space as part of their continuity planning. Wissmann doesn't see this as necessary for smaller offices like Thermo's eBusiness group. For one thing, in an emergency as widespread as the WTC disaster, there's a good chance that contingency space will also be unusable. Indeed, in the weeks following September 11th, several companies reported that they couldn't get access to facilities.

Telecommuting worked well for Thermo once we were properly equipped. We supplemented regular work at home with weekly meetings in space provided by sympathetic vendors.

After protecting data and ensuring the ability to work remotely, the third focus of Thermo's continuity planning review was to thoroughly document all systems and put that documentation on the departmental intranet (which, of course, was backed up off-site). This way, if members of the development team were to be hit by a bus (or an airplane), their knowledge would not be lost with them.

There are less morbid advantages too, according to Wissmann: "If you want to be able to take a vacation without getting a cell phone call every day, make sure you're not so desperately needed—document your work."

If September 11th proved anything, it's that you can't plan for, or even imagine, every possibility. But now that Thermo has become much more vigilant about documenting work and backing up data, the company is much better prepared to work through a disaster. Some of the things that helped us get back to work after September 11th were the result of luck, and others of planning, but for the future, disaster planning is now a serious part of Thermo's business.

Personally, the thing that helped us get back to work best, as individuals, was having any kind of plan to work with. In the haze of tragedy, when no one is thinking straight, it may be that having even the driest business exercise to cling to is better than nothing. As Rudy Giuliani and others pointed out in the days following September 11th, maintaining the regular patterns of life to whatever degree possible is a positive and life-affirming step.

Editor's Picks