Over the past couple of days, the Westminster College data center has experienced two total failures.  Sunday evening, storms ripped through Fulton, Missouri and took electrical service with them.  The batteries in our data center are good only for about 45 minutes of backup power before systems go down.  Sunday night’s electrical outage lasted longer than 45 minutes, so the data center and all of its services, including networking, wireless networking, DNS, DHCP, ERP, email, Internet gateway, etc, went down for the count.  Although power glitches are not unusual here, it is unusual for an outage to last that long.  I was three states away at the time, so my staff brought our data center back up and we went on our merry way.

Yesterday, I made the trek back from Louisville, Kentucky after having a fantastic visit with the TechRepublic staff (thanks, guys!).  I got home late, so I checked my calendar and, lacking any meetings for Tuesday morning, decided to sleep in and head into the office mid-morning.

The best laid plans…

Around 9AM, my boss, the president of Westminster College, calls me on my cell phone.  He doesn’t mind than I’m still groggy, but he does tell me, “You might want to come in.  The hill behind Westminster Hall [our main admin building and home of our data center] gave way last night and took out electrical service to the building.”  Uh, oh.

I got ready as quickly as possible and went into the office to survey the damage.  Indeed, the hill had begun to collapse.  We’ve gotten a ton of rain this year and it finally caught up to us.  During the collapse, the main electrical feed to Westminster Hall was torn, literally, from the transformer than powers the building as well as two other buildings and our data center.  The transformer itself was damaged beyond repair.  Our city electrical workers and the college’s plant operations staff worked tirelessly today to restore electrical service to, at a minimum, Westminster Hall, but they managed to get power back for all three buildings.  The transformer was replaced and new, but temporary, service lines were run to the transformer so that the buildings could be energized.  In all, we were down from about 11PM or so Monday evening until around 4PM Tuesday afternoon.

Without Westminster Hall, the college has no data network, no telephone system (the phone system batteries, after fighting valiantly for about twelve hours, finally succumbed to the inevitable), no Internet and no servers.  Worse, today is the day that payroll has to be run.  After fighting with a couple of inadequate backup generators, we finally simply moved the necessary hardware to another building and performed the tasks necessary to get payroll done.  We also took advantage of the unplanned downtime to finish some work we’ve been wanting to do in our server room.

We learned a number of lessons today:

  • A backup generator is no longer optional.  We’ve actually already begun the planning to install a backup generator for our data center and phone system.  An electrical engineer visited campus a couple of weeks ago to help us plan our efforts.  Although I met no resistance from the executive team when I initially proposed this installation, today’s events sealed the deal in a way that I would never be able to articulate.  Without our data center, no one could do their jobs.  We sent people home and struggled to handle payroll.  IT isn’t a “side-by-side” operation anymore like it was in the old days.  We can’t just revert to paper and pencil to handle business operations.
  • You can’t plan for everything.  In our incident planning discussions, we never talked about the possibility of a landslide.  This is Missouri.  Flat country.  Sure, we’re on a hill, but this is a Missouri hill we’re talking about, not some place from the Pacific shores!  Our incident responses must be flexible enough to be applied to any incident, not just the ones we define as likely possibilities.
  • Focus on the critical things and consider the rest to be gravy.  Today, payroll was job #1.  Our last summer group left campus last week and we have no students or faculty on campus.  And, people have to be paid on time and in an expected way.  Early on, we decided to hold out for a generator that was being brought in by the city that would have been able to power our whole data center while the workers replaced the transformer.  The generator was to be wired into one of our building panels that include the data center.  After 3 hours of work, we found that the generator was not putting out the right voltage and it was determined that the unit was bad.  So, in hindsight, we blew three hours of payroll processing time hoping that the “big win” (getting the whole data center energized) would come to fruition.  Instead, we should have focused on the critical element-payroll-and looked at anything beyond that as gravy.  Instead of waiting to start payroll processing at 2PM after moving servers to another building at 1:30PM, we should have immediately moved the servers this morning so as not to risk the 4PM payroll deadline imposed on us by our bank.
  • Have good relationships with outside agencies.  Our city crews really did amazing work today.  They went out of their way to make sure that power was restored as quickly as possible.  We enjoy good relations with the city, though, and I’m sure that goodwill played into our restoration.

The good news: I’m writing this blog posting from my work computer Tuesday night after power has been up for a few hours.  Although the situation we encountered is serious, there are a lot of takeways to be had that we can now apply to our next situation and to improve our systems.