General discussion


Good Disaster Recovery War Stories: Got Any?

By robo_dev ·
I've worked through four pretty significant IT disasters over the years.

The most interesting and avoidable one was the following:

There was a simple move planned for a computer mainframe system from one data center to another on the same campus. Easy, right? What could go wrong?

The computer movers showed up at 6AM on Saturday, and quickly loaded the 18-foot truck with the processor units, controllers, tape and storage systems, no problem.

One tiny detail was missed.

One of the movers stopped to have a smoke.

He was the one who normally closed and latched the back door of the truck.

As the truck pulled up the loading ramp from the loading dock, the rear door swung open, and several million dollars worth of IBM disk arrays slid off the back of the truck, falling about four feet to the pavement.

A big 'oh shoot' moment.

24 hours and a restore-from-tape later, things were working.

This conversation is currently closed to new comments.

Thread display: Collapse - | Expand +

All Comments

Collapse -

Fire Safe

by oldbaritone In reply to Good Disaster Recovery Wa ...

This one didn't have as happy a result.

Customer had a fire. Everything was destroyed. When we started installing new equipment in a temporary location, we asked about his backup.

He said, "Oh, sure. That was kept in the fireproof safe. I'll get it."

He hadn't listened when we told him that "Fireproof safes are designed to protect PAPER from fire. They aren't any good for computer media."

He pulled out the box of backup DVD's, still cube-shaped. We ended up peeling the cardboard box off, to show him that his "backup" was a solid block of glop. No more disks.

PAPER can get as hot as 700 degrees without burning, if there is no oxygen. That's how "fireproof safes" work - they moderate the heat somewhat, and keep the oxygen out. But plastic melts at MUCH lower temperatures, and the computer media is destroyed.

OFF-SITE backups are the only way to protect against this kind of disaster. Fortunately, there are now many online services that can provide this to SOHOs at a reasonable cost.

Collapse -

This is exactly what I was looking for

by seanferd In reply to Good Disaster Recovery Wa ...

when Sonja asked me to come up with a question of the week thingy way back.

Beautiful. I bet that was one expensive 'oh shoot' moment. :0

Collapse -

it was 'Priceless'

by robo_dev In reply to This is exactly what I wa ...
Collapse -

Once upon a time lived a Windows NT 4 server

by JamesRL In reply to Good Disaster Recovery Wa ...

It didn't live in the data centre with the Unix servers. It lived in a spare office.

I was the database admin and server admin for a Lotus Notes app (not mail) server. I selected the hardware, including a tape backup. It came preloaded and partitioned to order by Dell, all I had to do was load the app and the data.

The server was pretty important. We had created a project tracking system which held all the project docs and plans, for a very large IT organization. It was about 200 users plus 100 more read only users who came in via a web service.

I made backups using a standard 3 tape rotation everyday. The project sponsor was a little nervous though and asked me to make a one off backup on his Iomega external SCSI cartridge drive. Now from my years of working with Macs, I knew well that SCSI drives needed termination on one of the drives in the chain. But the IOMEGA was supposed to have a fancy auto detect mode that would turn termination on if required.

So with that knowledged, I powered down the server, plugged in the Iomega and powered back up...... discover that not only was I facing a BSOD, but that the boot sector on the main HD was toast. And before I could restore I needed to resolve that one way or fix the boot sector. Booting from the OS CD wasn't working.

I did manage to make it work, but not until a longish night (bed at midnite) and then came in early (6 AM) when I could think clearly. I reinstalled the OS, reloaded the app, and restored the database from tape, and was up and running just a few minutes after 9 AM.

I later researched and found there was a known bug with the Iomega device and the SCSI chipset embedded on my Dell server.

Collapse -

IBM X236

by NickNielsen In reply to Good Disaster Recovery Wa ...

Primary store server, set up with two pairs of SCSI hard drives, each pair running in a RAID 1 configuration. The ServRaid controller reported a DDD (Defunct Dead Disk) condition for Drive 2 of the the first array (C partition). Not a problem. Pop the old drive out and slap the new one in, rebuild starts automatically, takes about 15-20 minutes for a 37GB drive. Except that during the rebuild, Drive 1 in that array died and the rebuild failed.

The failure killed all store apps except point of sale (running on its own redundant system). Level 3 support had to upload the 17GB "new install" image to the FTP site, then the on-site tech had to wake up from his nap long enough to start the download. The tech then put the image on a hard drive and Ghosted it onto the server. Finally, level 3 talked him through the initial site configuration to get it up on the network so they could dial into the server and finish the recovery.

Server recovery took about 18 hours, most of that time watching the bits go by. Thankfully, I wasn't the tech on the ground for that one, but it sure made a long day for the guy involved.

Collapse -

17GB over a WAN link....yowza.

by robo_dev In reply to IBM X236

There are some things you just 'don't want to be there for'

Collapse -

In this case

by NickNielsen In reply to 17GB over a WAN link....y ...

The tech involved got permission to connect to the client network and log through the firewall to our corporate FTP site. Still not fast down a T1 pipe, but beat the h3ll out of the Verizon air card that was his other option.

For all of my stores, except probably those in Myrtle Beach, it would have been faster for me to drive home, do the download there, and drive back.

Collapse -

Been there....

by JamesRL In reply to IBM X236

My group consists of the third level types. We support the field types.

What we do in that situation, which we find ourselves in about once a month, is build a set of drives in the lab. We then overnight them. I've drive to the airport to ship via airline before.

Collapse -

I think that would have been a consideration

by NickNielsen In reply to Been there....

had the point-of-sale been down as well.

One of the most commonly-observed best practices in retail is to install the point-of-sale controller on its own system, separate from the store application server. Even when you can't inventory or order, you can still sell. Redundant mirrored controllers is another best practice, but I've only seen it consistently implemented in 24-hour stores.

Collapse -

I wish our customers would buy that

by JamesRL In reply to I think that would have b ...

The trend we have now is ASP - we have huge redundant datacenters, and when customers are buying for the first time or renewing, well over 60% are choosing ASP over in store servers.

Related Discussions

Related Forums