Disaster Recovery

Disaster recovery: What's the worst that can happen?

Disaster planning is like buying car insurance -- you have to have it, but you hope you never need it. When the worst happens, you'll appreciate the extra effort.

We had a major outage at work this week. It will end up costing the company a small fortune, so I hope that the company will learn its lessons from the episode; otherwise it will be a lost opportunity of the worst kind.

-------------------------------------------------------------------------------------------------------------------

Planning for disaster is essential. OK, so if my home laptop gets dropped, it will hurt me a little, but nobody will be out of work, nor will I lose any more than the cost of a replacement. It’s a different story when you are talking about a server that is running an application upon which a lot of people rely. If you have workers standing idle, the costs start to mount up. It isn’t only their hourly pay that is running to waste, it is also the loss of production, the annoyance to customers, and the sheer frustration suffered all round.

All this needs to be budgeted for. As always, the people with control of the purse strings will have to be convinced that back up and disaster recovery plans are necessary. After all, nobody wants to spend money when there is no intention of using the equipment, yet the act of putting in place a duplicate system will be a minor cost, compared with the amount of trouble a loss of service will cause.

Working out the worst-case scenario is important. In the words of the Dr. Pepper advert, you have to ask yourself, "What’s the worst that can happen?"

Basically you have to plan for a total loss -- whether by theft, fire, earthquake, or terrorist action -- of your system. You have to decide how long you can be without that system before you start to lose an unacceptable amount of service.

So what is the worst that can happen? Well, for us it was the call logging system going down on Monday morning and not coming back up until Thursday afternoon. It was a very stressful week, with work arriving by phone, usually several calls at a time. The office staff were working flat out taking calls, writing down the details, and calling the jobs through to the field engineers. In turn, we had to keeps notes of everything and pray that we didn’t need to use the system to order replacement parts or look up any customer information.

Somehow we managed to get through the week without anyone getting killed, but it stretched our customer service skills to the limit. I spent a lot of time apologizing to customers for late arrival, and it didn’t help that my nearest colleague was away from work on a training course and I had to cover his area as well as my own. Good news is that this week I have an extra day off for May Day bank holiday, so I will take the extra time to unwind on the beach.

I sincerely hope that this catastrophe doesn’t happen again, but if it does, I trust that the application support people will have a better plan in place.

19 comments
Da Saint
Da Saint

DR Pet Peeve - Tape Backup This cartridge, needs this drive, which needs this driver, which uses this version of backup software (and can somebody tell me why it's not backward compatible!), which only works with this OS... (I can see heads nodding) I work with Small/Medium businesses and Home Office users primarily. With the price of USB drives today, I recommend getting TWO drives. One attached for unattended backup, and the other which is offsite (anywhere!), and gets swapped out with the attached drive on a regular basis. This system is flawless and has actually saved one of my customers once already.

user support
user support

Fire drills, disaster recovery exercises and the like are intangibles like insurance policies that help avoid large losses of time and money. Our group only tests a mid-range computer twice a year at an off site location. Restoring servers is tasked to another group. Our goals are to make sure the users can confirm if the proper backtup tape has been restored. In the beginning of our testing this was always a problem. Currently, there is only a hiccup if we are able to change from personnel who test on a regular basis to a new employee. Replacing testers occassionally is a good thing because the new tester may have to use documentation to perform their job. This is one way to find out if user documentation has to be updated. (Note - It is up to users to update documentation if the job process changes over the course of time between DR exercises.) After it has been verified that the system has been restored to the correct date, it is up to the users to input and modify data to see if that process works. During testing, personnel need to fill out checklists of whether process A-Z was successfull. Anytime IT needs to intervene, we ask the user mark the test as not successful and note the comments of why IT was needed to make a process work. After testing is complete, IT staff takes all checklists including their own and compiles and After Action report of What went right, what went wrong (problems) and what can be done to improve testing.

grant
grant

I had a client who had propped the back door open to get some fresh air into the office, good idea, as it cut the cost of an air con system. they then phoned me to say that no one could connect to the server, which was in the back office, "was" being the operative word, someone had nicked the server! ouch!

reisen55
reisen55

World Trade Center, 101st floor, South Tower survivor writing. I was system administrator for Aon Consulting and our servers crashed all the way from 103rd floor lan room to the street. THAT's a crash. When we all returned to our new corporate office, walking in to see about 700 Dell desktops and 300 laptops is terrifying. As a small business consultant, I take the lessons from this event and apply it to my accounts. My main goal is that Disaster Recovery per se does not exist for my accounts. I protect them eight ways over. They have no data EVER to lose because I have it protected through dual redundancy channels and also periodically TEST my recovery scenarios fully and completely. Failed server restoration to operational in 10 minutes or less. Why? I test, understand the procedures that have to be performed and keep spare hard drives all set to go on my office shelf offsite. Been there, done that.

Jeff Dray
Jeff Dray

It made us look like a bunch of amateurs

reisen55
reisen55

SATA hard drives are a bargain and FAR more reliable and faster than tape AND far cheaper than INTERNET STORAGE too. Oh do I have issues with that one. 200 gb of net storage costs $1999.99 at one site per year. A one terabyte SATA drive costs $90 or so at MicroCenter. I work in a triad format. 1 = primary system with data on it. 2 = onsite backup media, SATA as above 3 = offsite backup media, SATA as above. I also ghost image critical systems and that includes servers. I have drives on my shelf for windows servers so in the event of a crash, I can reinstall and copy up relevant data for a 5 minute restoration. KUDOS SIR.

Jeff Dray
Jeff Dray

If you hadn't seen it with your own eyes. I once heard someone define an intellectual in Swansea as someone who knew how to wear his baseball cap the right way round. My bosses think that I can 'pop' to Swansea because it must be near me as the names are nearly the same!!! (Swanage)

robo_dev
robo_dev

If it hasn't been tested, the best plan in the world will be brought to it's knees by some silly database patch or router config change. I've worked with some companies which treat DR like they do SOX testing...some forms to fill out to make the auditors happy. And I've worked with other companies who take it very seriously. Granted, the commitment to DR in terms of cost and time should be in-line with the value of what you're protecting, and what you would lose if it fails. But even the simplest plan for the smallest company should be tested, or it's not worth the paper its printed on.

agould
agould

We too take DR seriously, but run into lots of difficulties trying to write the technical doc on all levels. We cannot figure out who should be the audience, and then what to include or not. Do you know of any good CURRENT models for writing doc? I'm talking Technical - how things work, and operations - how to work them, if you see the difference. Any ideas would be appreciated.... Thanks.

jmarkovic32
jmarkovic32

I had a NAS crash on me after bringing it up after a physical move. The NAS contained the database for our Accounts Payable software as well as server backups and some user documents (I was halfway through a file migration). Every quarter I did a complete audit of the backups making sure that I could restore files from common scenarios. I felt good. I had the NAS backing up to itself on another partition and duplicated that backup to a tape in the event that the entire NAS died. Well, the OS on the NAS crashed after moving it to a new data center. I was not able to restore the software and had to call the vendor because I thought I was still on warranty. Wrong! The device was now owned by a new vendor and only my hardware support transferred and I had to pay for software support (since it was the OS that crashed). It usually takes two weeks to purchase something at my company and I didn't have the time. As a brief aside, the reason I went with a NAS appliance is because I didn't have to worry about OSes crashing and if it did, I could just reload the firmware. But, lo and behold, this was actually a linux box and linux had a serious OS failure. I couldn't even get to the command line because the admin password didn't work! But I digress! My saving grace (I thought) was the duplicate tape job. I tested it and I know it worked, but I always use it as a last resort because of the high recovery time. Well, my day got worse! Due to a glitch in the backup software, since I was encrypting the data on tape, when the source device (the NAS) is unavailable, the restore does not run!!! I wanted to cry as I read the vendor's knowledgebase article from TWO YEARS AGO which said, "We know this problem exists and are working on a solution." My heart sank as I sat there in my office at 10pm. I followed the book. I had multiple backups, audited the hell out of them and tested every scenario I knew how and I get burned by a software glitch. The result, well miraculously, while the OS of the NAS was toast one of my servers was still mapped to one of the volumes and I was able to browse them. I copied what I could to an intermediate server (the A/P files included--thankfully). Now I have everything running on a Windows file server. I abandoned the backup software for everything but archiving and now have a contract with a 3rd party company who handles our remote backups. Lesson learned? You can't test for every scenario including backup software glitches. Test what you can test and always give yourself multiple options. Never give yourself the "keys to the kingdom" when it comes to recovery. Spread the responsibility to another qualified individual or a third party company. You can try to be the hero all you want, but you'll set yourself up to be the goat as well.

reisen55
reisen55

I have spent time recently with a datacenter that went nuts trying to get a good 54 server backup with BackupExec 12.5 - a truly horrible product. BUT they never EVER considered how to put it all back together. A co-consultant I work with, Harvey Betan, recently conducted a restore simulation with a client at the IBM Facility in Sterling Forest, NY and it was a long evening but well worth the experiment. Mistakes were made and documented. EXACTLY the idea. For my clients, I regularly test and update my protocols so I (and my clients) DO NOT HAVE A DISASTER TO RECOVER FROM!!! Only a temporary loss.

reisen55
reisen55

I use WINAUDIT to conduct a full (near 500 page long) audit of every system I support, so the details (too many of them really) are always available. I use GHOST to create and save hard drive images of every system I support. And I have work-around redundancy options for any situation I can think of for data loss or service interrupt. Gee, the World Trade Center did teach me a thing or two.

NickNielsen
NickNielsen

You already saw step 1. Step 2 was "Configure the network connections and verify the configurations by pinging the lab server." I finally got connected to the senior server admin who wrote those instructions. He had a hard time getting used to the idea that I was not a complete idiot, just a competent technician who had absolutely [u]zero[/u] Unix experience. About 6 hours into the process and halfway through the second cell phone battery, I finally got him to the point he was issuing me the actual commands (e.g. "su netconfig") rather than telling me "OK, now configure the NIC. As I said, make no assumptions. In this case, the server admin had assumed that another Unix server admin would be doing the rebuild. We both learned a lot that day...and a week later, complete, accurate, and step-by-step restore instructions were delivered to every tech in the field. B-) edit: clarify

Da Saint
Da Saint

Thanks for asking - I meant the basic OS and hardware setups shouldn't have to be spelled out. Of course, the particulars of how the core applications is needed, but you shouldn't have to tell the guy how to put the box together.

NickNielsen
NickNielsen

[i]Document the basic information that any tech worth his weight would need to recreate/recover the server. List the specs (connectivity, drive mappings, user IDs & rights, etc.), without having to list step by step instructions on how to build a server.[/i] Define step-by-step. Do you mean telling the tech create a partition this size, then install the server OS? Or do you mean telling the tech how to use FDISK or GParted? The former is more than acceptable and should be included; the latter will insult the intelligence of anybody performing the procedure. When creating the docs: * Assume the restore tech knows [u]nothing[/u] about your system. * Include [u]all[/u] required steps; this includes installing and configuring any server apps, restoring data, system tweaks, etc. You don't know that you will be doing the restore. * Make sure all configuration data is correct. Provide a matrix or guide for calculating any conditional entries. * Verify your documentation to make sure it is accurate. Give the docs to somebody who is technically competent but has never done the procedure and let him work through the steps. Take notes and make corrections. Repeat until no corrections are necessary. I've been the poor SOB on the front line doing the rebuild/restore. There's nothing quite like being told you have six hours to restore the server and reading a first step of "Restore and configure the operating system and all server applications."

Da Saint
Da Saint

KISS ! Your audience is the guy who's recovering your system. Document the basic information that any tech worth his weight would need to recreate/recover the server. List the specs (connectivity, drive mappings, user IDs & rights, etc.), without having to list step by step instructions on how to build a server. Include the normal location of the data and current backup as well. On the desktop/laptop side, document how you accesss the server? You can expand this to include drawings of the network etc. but the basic info on how you use the system is really need at the time of the disaster to get back up and running.

Neon Samurai
Neon Samurai

"But, lo and behold, this was actually a linux box and linux had a serious OS failure. I couldn't even get to the command line because the admin password didn't work!" I'd specify that it wasn't "Linux" that ate your data but rather the manufacturer's hardware product. That is totally up to them to work with you on fixing as far as I can see. Well, unless you had an indication that the software failure was specifically in the kernel. It is possible but the bigger problems seem to be political ones with the company. Wow though, that's a rough day.. heck, rough week at minimum for you. I don't envy you at all with that kind of fisting from your hardware vendors. Offhand, what was the NAS maker and model and what was the backup software or was it included with the NAS? I have a few appliances with clients so it would be handy to know what vendors to watch closely. For something like that, I probably would ahve done a Debian, BSD or OpenNAS install on a self-built box; full network speeds, fully configurable hardware RAID through motherboard and the possibility of booting from a liveCD in cases such as yours. This idea doesn't help much after the fact though unless your looking at more dedicated network storage.

reisen55
reisen55

It all depends on the financial commitment to risk tolerance levels. Having another office or center with 100% replication levels is enormously expensive. Like inventory controls, getting from 95% inventory compliance to 100% costs a ton of money to get that extra bit of coverage. It depends on your business and how much down time you can tolerate. My theory is that minimal is always good and pre-testing periodically gives you your downtime numbers from failure to restoration. Look up Harvey Betan - he is my old manager from Aon Consulting - World Trade Center - and he is a brilliant BCP/DR planner.

robo_dev
robo_dev

we're talking $50K-70K per month. Data mirrored in real-time to a SAN at a hot site, with one floor of an office building setup with desks, PCs...even food,water,and cots. Annual testing involves performing a complete business day of actual customer transactions/business at the remote site. Guess what, it works. I've seen other companies where DR testing means seeing if they can restore a couple of databases to the mainframe from tape....woo hoo.

Editor's Picks