Data Centers

An outage: Lessons learned

Westminster College does not currently have a backup generator protecting its data center and telephone system. That is about to change. Backup generators are no longer optional. That--and other lessons learned--from recent outages at the college.

Over the past couple of days, the Westminster College data center has experienced two total failures.  Sunday evening, storms ripped through Fulton, Missouri and took electrical service with them.  The batteries in our data center are good only for about 45 minutes of backup power before systems go down.  Sunday night's electrical outage lasted longer than 45 minutes, so the data center and all of its services, including networking, wireless networking, DNS, DHCP, ERP, email, Internet gateway, etc, went down for the count.  Although power glitches are not unusual here, it is unusual for an outage to last that long.  I was three states away at the time, so my staff brought our data center back up and we went on our merry way.

Yesterday, I made the trek back from Louisville, Kentucky after having a fantastic visit with the TechRepublic staff (thanks, guys!).  I got home late, so I checked my calendar and, lacking any meetings for Tuesday morning, decided to sleep in and head into the office mid-morning.

The best laid plans...

Around 9AM, my boss, the president of Westminster College, calls me on my cell phone.  He doesn't mind than I'm still groggy, but he does tell me, "You might want to come in.  The hill behind Westminster Hall [our main admin building and home of our data center] gave way last night and took out electrical service to the building."  Uh, oh.

I got ready as quickly as possible and went into the office to survey the damage.  Indeed, the hill had begun to collapse.  We've gotten a ton of rain this year and it finally caught up to us.  During the collapse, the main electrical feed to Westminster Hall was torn, literally, from the transformer than powers the building as well as two other buildings and our data center.  The transformer itself was damaged beyond repair.  Our city electrical workers and the college's plant operations staff worked tirelessly today to restore electrical service to, at a minimum, Westminster Hall, but they managed to get power back for all three buildings.  The transformer was replaced and new, but temporary, service lines were run to the transformer so that the buildings could be energized.  In all, we were down from about 11PM or so Monday evening until around 4PM Tuesday afternoon.

Without Westminster Hall, the college has no data network, no telephone system (the phone system batteries, after fighting valiantly for about twelve hours, finally succumbed to the inevitable), no Internet and no servers.  Worse, today is the day that payroll has to be run.  After fighting with a couple of inadequate backup generators, we finally simply moved the necessary hardware to another building and performed the tasks necessary to get payroll done.  We also took advantage of the unplanned downtime to finish some work we've been wanting to do in our server room.

We learned a number of lessons today:

  • A backup generator is no longer optional.  We've actually already begun the planning to install a backup generator for our data center and phone system.  An electrical engineer visited campus a couple of weeks ago to help us plan our efforts.  Although I met no resistance from the executive team when I initially proposed this installation, today's events sealed the deal in a way that I would never be able to articulate.  Without our data center, no one could do their jobs.  We sent people home and struggled to handle payroll.  IT isn't a "side-by-side" operation anymore like it was in the old days.  We can't just revert to paper and pencil to handle business operations.
  • You can't plan for everything.  In our incident planning discussions, we never talked about the possibility of a landslide.  This is Missouri.  Flat country.  Sure, we're on a hill, but this is a Missouri hill we're talking about, not some place from the Pacific shores!  Our incident responses must be flexible enough to be applied to any incident, not just the ones we define as likely possibilities.
  • Focus on the critical things and consider the rest to be gravy.  Today, payroll was job #1.  Our last summer group left campus last week and we have no students or faculty on campus.  And, people have to be paid on time and in an expected way.  Early on, we decided to hold out for a generator that was being brought in by the city that would have been able to power our whole data center while the workers replaced the transformer.  The generator was to be wired into one of our building panels that include the data center.  After 3 hours of work, we found that the generator was not putting out the right voltage and it was determined that the unit was bad.  So, in hindsight, we blew three hours of payroll processing time hoping that the "big win" (getting the whole data center energized) would come to fruition.  Instead, we should have focused on the critical element-payroll-and looked at anything beyond that as gravy.  Instead of waiting to start payroll processing at 2PM after moving servers to another building at 1:30PM, we should have immediately moved the servers this morning so as not to risk the 4PM payroll deadline imposed on us by our bank.
  • Have good relationships with outside agencies.  Our city crews really did amazing work today.  They went out of their way to make sure that power was restored as quickly as possible.  We enjoy good relations with the city, though, and I'm sure that goodwill played into our restoration.

The good news: I'm writing this blog posting from my work computer Tuesday night after power has been up for a few hours.  Although the situation we encountered is serious, there are a lot of takeways to be had that we can now apply to our next situation and to improve our systems.

About

Since 1994, Scott Lowe has been providing technology solutions to a variety of organizations. After spending 10 years in multiple CIO roles, Scott is now an independent consultant, blogger, author, owner of The 1610 Group, and a Senior IT Executive w...

25 comments
tuan_cassim2000
tuan_cassim2000

I need to lesson on program with excel if you can help me to lean this program please email me tuan_cassim2000@yahoo.com.

YourAverageManager
YourAverageManager

No confusion, I appreciate and look forward to your posts Scott! Scott relayed "In our incident planning discussions, we never talked about the possibility of a landslide." While it would have been amusing to say that a three-year-old on a tricycle knocked the pole down, do you see any difference in how you would ultimately need to respond? There are many ways for failure to occur at that same failure point, but your Business Continuity Planning (BCP) and DR response will be the same. BCP efforts are focused by planning for the worst case scenario. It does not matter how the electric power service to the facility failed in order to plan a response, all you need to plan for is electric utility infrastructure failure. Not having sufficient UPS is a present reality, part of the Business Impact Analysis, and part of the plan, meaning some risks are accepted and held by the organization. Decision makers may change their minds after living with the reality of that decision, or they may experience a moment of lucidity after reading the impact as described in a BIA report. Generators can cost millions. After hearing 1.2 million we decided to make reciprocal agreements with a non-competitor within the same state.

mc4668
mc4668

Dear Mr. Lowe, a very well written article! May I have your permission to translate it into italian for the collaegues, Clients, friends whoSE DATA I have been protecting since long (I have been an 'applied cryptographer' since the senevties) and whose power back-up systems are kept in a miserable state? (One of them in May lost 3.5 GBytes of ENCRYPTED data: I have sympathyzed with it, still I cannot help but smiling at the irony of the event). Thank you for your attention. Alan P. Borsalino

Powell Heuer
Powell Heuer

Just one additional thing to keep in mind with planning for back-up generators - consider your longest expected running time & therefore what size fuel tanks you need, including refueling arrangements. I've heard some sad stories of people who had good back-up generators but found that they didn't have big enough fuel tanks!

bkrateku
bkrateku

Two years ago was the worst case of outages we've ever had here. The power was down almost once per week for any given length of time. This lasted much of the year. IT had been asking for a generator for years, and this finally got us one. We hadn't had the generator for too long and may have not been in 2007 yet when they told us they were upgrading the city's transformers, etc. Each of the three outages were given four hours to complete...the longest lasted 3 1/2 hours. Fortunately, after those changes, we rarely have an outage for any length of time now, but we still used the generator when they come to keep our services going. That's especially good being we're a bank. The only downside to the generator? Back in the day, examiners would ask us if we regularly tested our UPSs. We always said "No, the power company does that for us." Now, it's more of a manual operation. :)

databall
databall

In dealing with power failures: The "get it done" voice on my left shoulder says go for the backup generator, even if it means moving the machines offsite. but the "do it right" voice on my other shoulder tells me to groom the system so that the servers come right back online after a power outage, without requiring manual intervention. In a system where the user terminals are down in a power failure anyway (ie. constant server uptime isn't an absolute must) which voice should I listen to?

Eugene
Eugene

Generators can also be made redundant, with n+1 modules. For instance, if you need 300 kW, you can do it with 3 of 100 kW plus one more, or two of 150 kW plus one more. Then taking a generator out of operation for service or repair is not a problem. Additional modules can be added later as your needs expand. Natural gas is usually the fuel of choice, because there are no re-supply problems due to its continuous supply through the pipeline where earthquakes are not a problem. Where natural gas is not available, propane should often be considered instead of diesel, due to the deterioration of diesel fuel in storage as little as 6 months long, and various pollution challenges. Generators for all levels of government and non-profits can be obtained on a NASPO Multi-State Contract available to all states without having to go through the bidding process, with engineering assistance included on the Contract. Generators for other buyers can use the special tax deduction in the Economic Stimulus Act, but it expires 12/31/08. Also, to avoid budget delays, see the list of grant money that may be available to you, or a friendly legislator can get you a "member item" grant. See www.BetterPower.us for more info on these topics.

Photogenic Memory
Photogenic Memory

Without adding to many familiar details in a which finger will be pointed and I loose my job ( the owner is an actual psychotic ); the company I work for has had this happen several times. Some of these steps might can be carried out in much smaller detail. Anyways, here's what they did.: 1.) Invested money in upgraded backup generators ( diesel powered and maintained regularly ). 2.) Invested money in serious UPS's. ( they'll handle for a little while if the backup generator doesn't kick in but not forever. ) 3.) Have critical functions distributed in different buildings for redundancy like servers, support, and preparation for role transition in case a problem arises. 4.) Trouble shooting strategies such as primary concerns, secondary concerns, and repercusions afterwards. Examples: Primary - Customer/company data, WAN connectivity( The Routers and switches ), and the PBX! Secondary = LAN connectivity ( like the aforementioned but also the servers and workstations ), and restarting the applications they're needed for. Repercusions = Prepare for possible reimbursement of lost services to customers, problems with the equipment afterwards, and staff burnout from the increased workload ( most companies don't give shit about humans so it may not apply ) Notes to pay attention to: If you have to support the hardware; please realize this! Unconditioned street power can be dangerous to servers because of surges but worse is low voltage. Low voltage can be more harmful to electronics without enough capacitance. You might either see servers behaving funny or corrupted data. Keep this in mind if your UPS is in bypass mode to the street. Also be prepared to be laid off due to the companies need to not loose anymore money even if you think your a useful staff member. Remember, a business and it's owners don't think in human terms and in their eyes; your nothing more than a loss of profit not always justified!!!!

kpcamp
kpcamp

DR planning is coming to a head with everyone these days. I have done quite a bit of research as we need to have a plan but to do it properly is financialy out of reach. I ran into a self contained mobile rack. Google "SPEAR mobile rack". This is something that we are looking into.

billcooey
billcooey

Power is only one consideration. Natural disasters usually take out the comm lines when the power goes. Also, a generator that will fully support data services is very expensive to install and maintain. Anything less then full replacement power means that you can only barely run mission critical services. In this day of wireless capabilities you would be much better off taking your PBX offsite to a hardened data center so that you can re-route communication to a wireless devices for the duration of the outage. I would think long and hard about what other applications are truly mission critical to the college that you couldnt leave down. The key is to keep systems alive so you dont have to reprogram anything. bc

matt.graybiel
matt.graybiel

I find it amazing that it takes a disaster of some sort to MAKE management realize that having a real disaster and business continuity plan is a good idea. If there was a simple plan in place like having a hosted site in KC or somewhere 15 miles down the road. You guys got lucky. Hopefully your management realizes that and pops for more than just a generator.

bandman
bandman

Sorry to hear that you're going through all of those issues, but you touched on the silver lining. Massive, total failure of the infrastructure frees you up to perform the sort of maintenance that you don't get windows for. I don't know your particular situation, but when it comes time to plan your generator for the building, do not make the mistake of forgetting to calculate in AC requirements. Dell's datacenter capacity planning tool is an immeasurable asset when trying to decide power/cooling requirements (since they're nearly the same thing). It's going to be a rough upgrade, but you'll be a better admin because of it.

reisen55
reisen55

A colleague of mine is a certified consultant in this field and a good friend too. We are both amazed at the lack of BCP/DR plans that exist for most firms today. I believe an article in STORAGE MAGAZINE indicated that 47% of firms do not have such a plan in place, and of those that do, many times they are shelved as a historical momento and not treated as a living document. Plans should be updated every few months and reviewed for changes, and also TESTED whenever possible. The time to test a DR recovery process is not at 2 am when you are doing it. Ever. I speak from experience: I was part of the team at Aon Group that rebuilt the New York segment of our network when our primary office (2 World Trade Center) suffered the ultimate server crash.

Scott Lowe
Scott Lowe

YourAverageManager, You raise excellent points. I would put us in early stages of realizing just how important BCP is to the organization. For example, there have been requests to install a backup generator, but funding hasn't been made available, with some people indicating that, "If computers are down, we'll just do it by hand." Well... no one remembers how! We do have an executive team on board now that is much more savvy about these kinds of issues, but we're obviously still learning. Thank for you raising these points. Our planning needs to be general enough to cover all bases and flexible enough to address specific situations. Scott

YourAverageManager
YourAverageManager

Disaster Recovery Journal (DRJ) had an article on the hidden risks. One was fuel line seal failure due to reformulating (diesel fuel) requirements.

Scott Lowe
Scott Lowe

Honestly, both. We are configuring our systems to come back online after a power outage, but also doing a generator. A generator won't cover every issue -- not even close -- but the unit we're looking at will likely power the data center plus our admin building. We're waiting for cost to determine feasibility.

gordonmcke
gordonmcke

Continuous disk replication is becoming important for any organization that depends on 7/24 hour uptime for their IT infrastructure. (Today, who doesn't that apply to?) With new virtualization technology, affordable broadband MAN circuits and increased options for regional ISP/hosting sites, disaster recovery is affordable for SMB clients today. IT Management should continue to educate senior executives in organizations about the new contingency options available. With the increased rate of natural and man-made disaster events (witness San Francisco last week), an emphasis on DR is prudent.

reisen55
reisen55

We had a squirrel eat through the local power lines at Morris Plains NJ office. Cooked the poor little creature of course and brought the whole office down.

The 'G-Man.'
The 'G-Man.'

Get the company to show the brass and they can have all the DR they like. Problem is they dont like spending on items that may never be used. It is all down to money.

bandman
bandman

Did you guys rebuild in NYC? The whole reason I'm in central Ohio is because my company physically trucked their servers off the island the first day they were allowed back on. We're only now relocating the machines back up in the area, this time in a co-location in NJ

zclayton2
zclayton2

Its not just refomulation - a good Spill Prevention Countermeasures Control plan will handle much of that. it includes delivery error - don't have any vent piping near the tank drop unless you want fule in it. And something not thought of - how long is your storage time? Diesel will go bad. Not as in the old gasoline to varnish hoo haw, but bugs will and do grow in it after several months. Your system may be tolerant of the degradation, but check it out. I work with some people that have to test and maintain against that reality.

bandman
bandman

We had a squirrel eat through a sonnet ring once at an ISP I used to work at. Rough way to find out that failover to the backup ring didn't work

reisen55
reisen55

Yes, we rebuilt in the city but in different office locations. I was with Aon Group and after spending about 2 weeks in Greenwich, CT office (where emergency staff location was done - we had folk all over the place), we found a new office at 685 Third Avenue and rebuilt that office. I remember first day there and being confronted with 400 Optiplex systems sitting in the reception area alone. In 2003 we moved to 55 East 52nd Street and had to rebuild that. Aon already had space at 199 Water Street, which was enhanced over time too. 55 East was eventually abandoned due to high rent (d-oh, it is on Park Avenue so what do you think....) and everybody moved into 199 Water. I have a little award from Aon for the rebuilding efforts (we were back to work within 2 days following that black day) and also was given $500 in spendable American Express checks. As we were all outsourced and fired in late 2005, it is a most hurtful award to have.

bkrateku
bkrateku

Just curious...would a fuel stabilizer work for this diesel engine? I know they have it for gas. We use propane on ours.

bandman
bandman

wow, that's a kick in the gut. It should have been all the experience you needed to get another position doing something similar, though. You'd hope, anyway

Editor's Picks