When the e-mail system fails

System failures happen. Things break. The job of the tech is to fix the problem. That's what we get paid to do. However, unless you manage the human element of the repair process, you risk alienating your co-workers and give management a bad idea of your professional abilities. This example shows that communication during the emergency is a critical part of keeping the respect of your peers and the trust of the business managers.

The calls starting coming in as I was on my way to the office. Incoming e-mail from the outside was no longer getting through. Internal e-mail and site-to-site e-mail was OK. Mobile users were receiving ActiveSync and BlackBerry messages and my Treo with Goodlink was fine. The problem was somewhere on our SMTP gateway. An e-mail outage is serious business. It is immediately escalated to a full-blown emergency.

Our e-mail processing configuration

We run our own Exchange Server. We are still on 2003 Enterprise edition which has proven to be a reliable platform. We process our incoming e-mail through two filters before sending it to the Exchange server. It first goes through Symantec Mail Security for SMTP. We check there for viruses and filter out all the bogus bounce messages or Non-Delivery Receipts from all the spammers that have hijacked our email addresses.

We next process the e-mail through our Commtouch anti-spam filter. Part of the engine and queues are on our gateway server. We send the e-mail out for filtering to the Commtouch regional processing centers. The spam is quarantined in case we have false positives. Only the good stuff gets through to the Exchange server. There it goes through yet another virus scan before it is ever delivered to the user mailboxes for retrieval.

The failure and resolution

E-mail processing is obviously a complicated process. There are a lot of components that can possibly fail. When I arrived at the office I immediately began looking for clues as to what part of the SMTP service was broken. It couldn't have been more obvious. When I logged on to the gateway server, several messages popped up indicating that the SMS filter-hub service had terminated unexpectedly at least fifteen times.

A manual restart produced the same results. It would run for a minute and then fail. I suspected that a piece of spam had defeated the engine. A look at the queues showed some malformed e-mail in the queue. It did no good to clear the queue and restart. Something was very wrong with the engine. A check of the Symantec web site reveals that a new patch had been released. We quickly download, install it and restart the service.

The human side of the emergency

Success! The whole problem analysis and resolution process took about 45 minutes. The majority of the time was spent in downloading and installing the patch. It took forever to stop the filter-hub service. All the while I was trying to do my job, I kept my junior associate at the door fending off the anxious employees. He would also occasionally go out to the various departments and provide an update to keep them informed.

I don't know how critical e-mail delivery is in your organization but in our business, it is the life-line of just about everything we do. So much depends on our e-mail system functioning properly. We could function without our accounting system for a day but it is possible that somebody could lose their job if the e-mail system were to be out for more than a few hours. People tend to get real nasty when they can't get e-mail.

Communicate during the emergency

I'm a professional. Years of experience in technology problem solving has allowed me to handle the most stressful of circumstances like this with focus and action that gets results. That's why they pay me the big bucks. OK, I'm bragging. The point of this post is to illustrate something that I hope you didn't miss. I made sure that another member of my team was actively communicating to management every step of the way.

Most business owners and employees don't understand technology. In fact, many of them fear it. I know, that's hard to believe but its true. When things don't work they tend to panic. Perhaps you may remember the feeling from the first time you had a system failure and didn't know what to do. If you keep a running dialog going with those who are affected, you will find that the emergency is much less stressful for everyone.