Working largely as a system/network administrator in IT for the past 25 years has provided me with a multitude of opportunities to make mistakes–and learn from them.

You might say I’ve seen it all. Examples range from forgetting which server I was logged into and accidentally shutting the wrong one down to making Windows registry changes without having a valid backup, to installing single servers without any redundancy in case they malfunction, or even changing the IP address of a domain controller without fully considering the ramifications.

True, mistakes are humbling and humiliating, but they can also help you become an even better IT professional if, in the aftermath, you focus on why processes failed and how you can improve them. It’s quite rewarding to implement new and better processes and then see them protect your organization (and yourself).

Allow me to share one of my biggest mistakes, which took place several years back. It’s a good thing this happened before more stringent company controls and change approval mechanisms came into being, as it might have been what we call a CEM (career-ending-maneuver). Sadly, this mistake caused my organization to lose productivity, and I, unfortunately, lost a friendship.

SEE: IT help desk support SLA (Tech Pro Research)

The environment

I worked for a small company, (about 200 people) which no longer exists. My IT group pretty much handled everything–server installation and maintenance, desktop support, patching, application rollouts, networking, firewall changes, backups, and of course, email. This organization had several typical departments including customer service, finance, HR, and the like.

The plan

We had an Exchange 2010 environment spread across two sites: Our primary site (where I worked), and our DR site (disaster recovery site) located in another state. We had three sets of Client Access and Mailbox servers; two were located in our primary site, and one was in our DR site.

It was my responsibility as the Exchange Administrator to test the failover of email operations from our primary site to our DR site to ensure if our main site went dead (such as due to a power outage) we could successfully conduct email operations in our DR site.

This involved running a series of Exchange Powershell scripts to basically make the databases in the DR site active, utilizing such commands as Move-ActiveMailboxDatabase, for instance.

The process had been tested and worked successfully in the past, so in this case, it should have been a routine 30-minute operation to run scripts, check databases, test results, and resolve any issues. Email would be inaccessible from five to 15 minutes depending on how timely the processes completed. Of course, I ensured beforehand that the databases were all synchronized and up-to-date to streamline the transition.

SEE: IT leader’s guide to cyberattack recovery (Tech Pro Research)

It’s always tricky trying to find the right time to test these types of operations. I scheduled the work from noon to 12:30 pm, reasoning that most people would be at lunch, users could still work in Outlook via the cached mode feature (although they wouldn’t be able to send or receive new information), and if something went wrong I’d have several hours to fix it. I sent out a maintenance notice via email two days in advance and right before the test operation.

The ordeal

As it turned out, it took several hours more than I originally planned. In a nutshell, I couldn’t activate the email databases on the DR servers because I got a series of quorum errors involving the file share witness. Panicked, I tried to activate the original email databases in the primary site but ran into similar roadblocks. Each site seemed to think the other was the primary site and wouldn’t cooperate.

To summarize without being too technical, neither set of databases would mount and with Exchange 2010 no mounted databases meant no email.

This particular Exchange error is like the Balrog from “Lord of the Rings” basically one of the most fearsome things you might ever encounter, and it’s maddeningly difficult to solve. I gamely tried to do so via various tips and techniques from around the internet, none of which worked.

The 12:30 pm maintenance window came and went, and that’s when things kicked into high gear because users, understandably, wanted to get back on their email.

I had no way of notifying them that their email was still down because, of course, I couldn’t use email, either. So stragglers began peppering the help desk staff around the corner from me with announcements and questions about the email problem.

Interestingly, IT staffers and leaders were positively blasé about the issue. “I now have one less thing to worry about; dealing with my email,” the CIO actually joked. “Take your time!”

That was definitely helpful.

SEE: Incident response policy (Tech Pro Research)

The reaction

I set a time window of one hour to fix the issue, after which I would contact Microsoft Support. My reluctance to do so right away stemmed from the fact that it took such a long time to get a case opened and actively worked on that I felt I could fix the issue myself.

As it turned out, I couldn’t solve the issue in one hour. I opted to get Microsoft on the line.

Then it happened. As I was on the phone with Microsoft support, a customer service manager approached the help desk staff to demand an update about the email situation. I always had a great relationship with this person, but the currency that sprung from that friendship was meager indeed.

“Does he understand how important email is?” the manager bellowed. I could hear the help desk staff trying to calm the individual, who then snapped: “Email had BETTER be back up by 4:30 pm – or ELSE!” and departed.

I just shook my head thinking: “Or what?”

SEE: Power checklist: Managing backups (Tech Pro Research)

We fixed the problem–finally–after much detective work and by working with various tools and repair operations. Simply put, the Database Availability Group (DAG), which Exchange relies on was in an unhealthy state before my operational work began. Email was back by about 4 pm, which is why I am able to tell the tale.

Oh, and it took about six months for that customer service manager to get over it. I understand staff was impacted across the company, but the attitude was that I brought down the entire email system for fun and amusement. Not much of either was had during that outage.

Lessons learned

A lot went wrong. I also learned a lot from the issue.

  • Always schedule after-hours maintenance.
  • Never assume something will work just because it did so before.
  • Test out the overall structure to confirm systems, applications, and components are healthy before you make any changes to them.
  • Have a valid backout plan rather than just undoing your work and hoping for the best.
  • Plan for an alternate notification of users (instant message, texts, an internal company website to announce issues, etc.) if you are doing something, which might impact normal business communications.
  • If the unthinkable happens, put up a physical sign in your area stating: “We know about the [type of] problem. ETA for resolution: [X date and time]–recognizing that it’s not always easy to come up with a reliable ETA, of course. This will reduce distractions and interruptions.
  • Get the best resources available ASAP. Obtain support contracts and streamline the process so that you can get vendor assistance as rapidly as possible.
  • Don’t take user hostility too seriously. In the end, nobody gets hurt or dies when IT failures occur (hopefully).
  • Document what went wrong and figure out how to prevent it in the future. In our case, we did all of the above, and the issue didn’t return.