Software

When the e-mail system fails

System failures happen. Things break. The job of the tech is to fix the problem. That's what we get paid to do. However, unless you manage the human element of the repair process, you risk alienating your co-workers and give management a bad idea of your professional abilities. This example shows that communication during the emergency is a critical part of keeping the respect of your peers and the trust of the business managers.

The calls starting coming in as I was on my way to the office. Incoming e-mail from the outside was no longer getting through. Internal e-mail and site-to-site e-mail was OK. Mobile users were receiving ActiveSync and BlackBerry messages and my Treo with Goodlink was fine. The problem was somewhere on our SMTP gateway. An e-mail outage is serious business. It is immediately escalated to a full-blown emergency.

Our e-mail processing configuration

We run our own Exchange Server. We are still on 2003 Enterprise edition which has proven to be a reliable platform. We process our incoming e-mail through two filters before sending it to the Exchange server. It first goes through Symantec Mail Security for SMTP. We check there for viruses and filter out all the bogus bounce messages or Non-Delivery Receipts from all the spammers that have hijacked our email addresses.

We next process the e-mail through our Commtouch anti-spam filter. Part of the engine and queues are on our gateway server. We send the e-mail out for filtering to the Commtouch regional processing centers. The spam is quarantined in case we have false positives. Only the good stuff gets through to the Exchange server. There it goes through yet another virus scan before it is ever delivered to the user mailboxes for retrieval.

The failure and resolution

E-mail processing is obviously a complicated process. There are a lot of components that can possibly fail. When I arrived at the office I immediately began looking for clues as to what part of the SMTP service was broken. It couldn't have been more obvious. When I logged on to the gateway server, several messages popped up indicating that the SMS filter-hub service had terminated unexpectedly at least fifteen times.

A manual restart produced the same results. It would run for a minute and then fail. I suspected that a piece of spam had defeated the engine. A look at the queues showed some malformed e-mail in the queue. It did no good to clear the queue and restart. Something was very wrong with the engine. A check of the Symantec web site reveals that a new patch had been released. We quickly download, install it and restart the service.

The human side of the emergency

Success! The whole problem analysis and resolution process took about 45 minutes. The majority of the time was spent in downloading and installing the patch. It took forever to stop the filter-hub service. All the while I was trying to do my job, I kept my junior associate at the door fending off the anxious employees. He would also occasionally go out to the various departments and provide an update to keep them informed.

I don't know how critical e-mail delivery is in your organization but in our business, it is the life-line of just about everything we do. So much depends on our e-mail system functioning properly. We could function without our accounting system for a day but it is possible that somebody could lose their job if the e-mail system were to be out for more than a few hours. People tend to get real nasty when they can't get e-mail.

Communicate during the emergency

I'm a professional. Years of experience in technology problem solving has allowed me to handle the most stressful of circumstances like this with focus and action that gets results. That's why they pay me the big bucks. OK, I'm bragging. The point of this post is to illustrate something that I hope you didn't miss. I made sure that another member of my team was actively communicating to management every step of the way.

Most business owners and employees don't understand technology. In fact, many of them fear it. I know, that's hard to believe but its true. When things don't work they tend to panic. Perhaps you may remember the feeling from the first time you had a system failure and didn't know what to do. If you keep a running dialog going with those who are affected, you will find that the emergency is much less stressful for everyone.

27 comments
vidyadhish_d
vidyadhish_d

Yes this is a post which truly states what happens when an email system, todays major form of communication goes down with non it managers eating the brains of the IT Manager who in turn expect the IT technician to know everything a weird ms software throws at it daily. I salute Tim for highlighting that communicating with the management is one of the biggest aspect even though u r just a tech guy! By the way awesome config for email setup.

Meesha
Meesha

In the twelve plus years that I have been in this organization, we've only had two "outages" and both were due to electrical grid failures. We use a BEZ server (Blackberry) as well as Domino on Linux servers, and other than some upgrades or servicing our uptime is beyond satisfaction. Our users don't even know what email "downtime" is. I would suggest to you that the choice in technology has a great deal to do with "uptime" expectations. I have not experienced the same level of reliability in one of our recently acquired companies that have Exchange server for their email and they're down so often due to so many different variables that corporately we have decided to migrate them off Exchange. Domino can by the way use the Outlook GUI instead of NOTES so this would solve any "user resistance". They'll get the best of Domino transparently while still using the tool they know. Think about it, the Domino solution is stable, scalable, reliable, secure and very cost effective. It seems to me that your configuration misses the mark on all these critical issues.

Photogenic Memory
Photogenic Memory

You can log and check the account under /var/spool/mail if that how your config is, right? Then you can vi the person's file and "dd" the bad header? That can be done in less than 10mins if you know what to look for in a Linux/Unix based email platform. Can this be done under Exchange? Also don't forget that many times DNS can be problem as well. Try doing a telnet session to a person's domain like this: telnet mail.oldguys.com 25 Trying mail.oldguys.com... Connected to localhost.localdomain (127.0.0.1). Escape character is '^]'. 220 localhost.localdomain ESMTP Sendmail 8.13.1/8.13.1; Mon, 7 Apr 2008 16:58:02 -0700 ehlo hotties.net 250-localhost.localdomain Hello Plutonium [127.0.0.1], pleased to meet you 250-ENHANCEDSTATUSCODES 250-PIPELINING 250-8BITMIME 250-SIZE 250-DSN 250-ETRN 250-AUTH DIGEST-MD5 CRAM-MD5 250-DELIVERBY 250 HELP mail from: someoldguy@oldguys.com 250 2.1.0 someoldguy@oldguys.com... Sender ok rcpt to: hotchicks@hotties.net 250 2.1.5 hotchicks@hotties.net... Recipient ok data 354 Enter mail, end with "." on a line by itself Hello hotchicks! I'm really, really old, and rich too! . 250 2.0.0 m37Nw2VN005186 Message accepted for delivery noop 250 2.0.0 OK quit 221 2.0.0 localhost.localdomain closing connection Connection closed by foreign host. =========== A telnet test is a good way to prove that the accounts can receive email and that the mechanism is working. I guess from this point; if people from outside can't send you email then it may point to DNS. From here check to see if you DNS is resolving with "dig" or "nslookup". If it is; then check to see if it's not being blocked by a firewall, overly agressive listing ( white/black/grey whatever ). Sometimes there may be nothing you can do if the network problem isn't on your end and resides in someone else's server.

jubernal
jubernal

I had this problem recently. I am thinking about moving the service to outsourcing.

zloeber
zloeber

As technical professionals I don't think many of us realize just how well we do our jobs. The fact that within the span of only 10 years we have, as a whole, built up a global Internet that generally always stays up speaks to this matter. Worker expectations are so high for e-mail and other IT services because our expectations are so high. Face it, when an e-mail server goes down it is not just down for the employees but down for yourself as well. So what do we do? We work through all the exchange 5.5 issues and voice our opinion of them online or directly to MS to get things like a recovery information store in the next version, or we team up virtual machines for redundancy. (Or run a properly configured sendmail/postfix/exim configuration!). Sure, in many ways the end-user drives us to do these things so we can enjoy other projects and not have to put out fires all day long. But their reaction to an outage is only as proportional as their expectations for up time. As we strive to keep things up for 99.999% of the time people have come to rely on this as the standard and not the exception. It sounds odd, but when people go zonkers over tech glitches that interfere with their work day it makes them realize just how much they rely on technology that they don't understand and it scares them. I sometimes almost want to take down our exchange server to recover a corrupt information store in the middle of the day so as to raise awareness of the importance of IT (rather than doing it at midnight on a Saturday to avoid much fanfare). Information is the drug of our generation. Disconnect a person from that steady flow of data from TV, E-mail, the Internet, and countless other mechanisms and people start to to panic. Just look at the "Great Blackberry Outage" as an example to this fact.

tim uk
tim uk

Agreed, it's probably the most critical application for most users. If you were recruiting an email support (MS Exchange) specialist, what skills and qualifications would you look for to back up their specialist status?

mike
mike

Fact 1: The U.S. and world economy functioned at many years, perfectly well without e-mail. Fact 2: A company's culture that would cause the technical staff to believe they would lose their jobs over a few hours outage is psychotic and abusive. Short of blatant negligence on the part of the technical staff, or staff's failure to actively engage and remain engaged in solving the problem, the technical staff need not worry about losing their jobs. Why? Because of Fact 3 and back to Fact 1 : There is fax, phone, personal or alternate e-mail accounts accessed through cell phone, etc. Fact 4: Nasty people who think they know everything, can always be offered the opportunity to fix the problem while the technical staff watches them make fools of themselves. Fact 5: There are people who expect computers and network infrastructure should be as easy to operate as a toaster. It never has been and never will be. Thus, the need for technical staff

The Listed 'G MAN'
The Listed 'G MAN'

"A check of the Symantec web site reveals that a new patch had been released" So why did it fail in the first place? Because a patch was released???? The same system that was working the day before without this patch? Do you keep up to date with patches or just go there when something fails? Glad you resolved the problem - I know what you mean regarding e-mail outage & users!

tmalonemcse
tmalonemcse

It can be difficult to fix a problem with an anxious customer or manager hovering over you. I have developed a team strategy that works for me in emergency situations. Read the post for more details: http://blogs.techrepublic.com.com/techofalltrades/?p=137 A sign of a professional is how well you can communicate with the client during an emergency. A successful outcome may depend just as much on how well you manage the customer as on how quickly you can solve the problem. We all have our horror stories of clients or co-workers that came unglued when their favorite technology broke. Have you got an interesting one to share? Join the discussion.

tmalonemcse
tmalonemcse

Hi Meesha, I used to manage a shop where we ran a Lotus Domino server. It seems to be an East-Coast thing. Our parent company was based back East and switched us over to Domino. Our experience with uptime was comparable but it seemed to require much more maintenance than Exchange. Don't get me wrong about Exchange Server. I much prefer it to Domino Notes. In three years we have only had two outages. The first was due to exceeding the built-in 16GB limitation of the Standard edition. We soon upgraded to the Enterprise edition. Just to be clear - the problem was not with Exchange Server. It was with the Symantec SMTP filter on our gateway server. I am not nearly as happy with Symantec these days as I used to be but that's a different story. We also run a BlackBerry server as well as a Goodlink Server and ActiveSync on the Exchange Server. The downtime I was referring to in the post was only a momentary block of outside email. The Exchange Server was never down. All the email came though after the SMTP filter was repaired. Are you in sales by any chance?

jmarkovic32
jmarkovic32

Nothing wrong with that, but I prefer to be a vendor agnostic. I use the best tools for the job. We use Exchange because it integrates well with our Microsoft environment. I've known large shops that have clustered Exchange environments and have virtually no downtime. It all depends on implementation and following best-practices. Some places can't afford to follow best-practices.

tmalonemcse
tmalonemcse

Hi Adam, I used to run Linux at a previous employer so I'm familiar with what you are referring to in your comment. However, in this case, the malformed emails never reached our email server. They were being held up in our SMTP filter - a third party front-end product from Symantec that runs on a gateway server. The Symantec product has a great GUI that allows the administrator to see what is in the queue. I was able to delete the bogus emails but they just kept coming and overwhelming the filter-hub component. Once we patched the filter-hub, all was well. I have used the telnet SMTP test many times. I got a kick out of your oldguys.com and hotties.net scenario. Very funny. Thanks. But again, this was not really an SMTP or a DNS issue. It was a filter issue. And yes, I like the way the Microsoft OS informed me of the problem immediately upon login. Will your Linux system do that? Probably can.

billbohlen@hallmarkchannl
billbohlen@hallmarkchannl

Most Exchange implementations are more than just MS Exchange and Windows Server. A deep understanding of the network protocols involved in mail flow (SMTP, DNS, TCP/IP, etc.) is required. Mail into an organization usually flows through one or more Gateways, which are designed to filter the "noise" (spam, malware, profanity, etc.) and only deliver valid messages to the Exchange servers. These can be single multi-purpose systems, or separate systems from separate vendors. They can be servers or appliances, or they can even be outsourced completely, like Postini. Usually there is at least one (most likely two or more from different vendors) Anti-virus/Anti-spam software packages installed that are specific to e-mail. One is usually part of the Gateway, the other installed on the Exchange server as a final check before it is delivered. There are also one or more mobile messaging systems that interface with Exchange Server, such as BES (Blackberry Enterprise Server), and MCS (Microsoft Communications Server/ActiveSync). These systems are designed to sync mail with handheld devices. Finally, they also need to understand the front-end clients...whether it is the full Outlook client, Outlook Express, Outlook Web Access, and handheld devices such as Blackberry.

jmarkovic32
jmarkovic32

Always have management set up an external email address and then link it to Outlook. Our CEO uses his Yahoo email often so during the outage, he didn't miss a beat. I work for a non-profit where our IT mantra is "if it ain't broke, don't fix it even if it's about to break".

wdperry
wdperry

Outage period basically equates to lunch break time. Is this busines in a real-time, critical, life-threatening operation or just another example of hightened employee expectations? Have been observing computer / employee interactions since 1961.

tmalonemcse
tmalonemcse

Hi G-Man, Good questions. Thanks for reading the post. I wasn't very clear about that, was I? And thanks for pointing it out. No, I don't ordinarily apply every patch to my Symantec products. You can break things that way. I only apply patches that I feel are needed. No, the product didn't fail because a patch was released. We just started using the filter-hub last week to block bogus NDRs. I suspect that others had experienced this particular problem with the filter-hub. You can read more about our problems with spam and particularly bogus NDR messages, and that in these two posts: http://blogs.techrepublic.com.com/techofalltrades/?p=134 and http://blogs.techrepublic.com.com/techofalltrades/?p=135 From the Symantec release notes for this particular patch: 20838 Component: Filter Hub Platforms: All Formerly, messages with recipients addresses that did not contain a domain were causing issues with the Filter Hub. This issue has been resolved. Source: http://service1.symantec.com/SUPPORT/ent-gate.nsf/docid/2006102014592163 When I saw the error message on my server about filter-hub failure and this note about filter-hub problems being resolved by the patch, I figured it was worth a try. In my experience, every time we have called Symantec for tech support, they first ask if the product is up to current patch levels. So we patched before calling. It worked. I like the way Symantec describes the failure of their product as "issues." I guess they had not anticipated having to scan hundreds of bogus e-mails with malformed addresses. We have been getting hit with thousands of these pesky messages in the past week. This particular emergency was resolved in record time. If it had gone on any longer I can imagine the knives would be coming out. Happy and contented co-workers looks awful when they are deprived of their e-mail fix.

alex.a
alex.a

I recently retired, but at my former firm I once arrived at the office only to be greeted by one of my staff who said that a particular SQL server had crashed. The server was one of three that hosted the various databases in our document management program. The server had indeed crashed with a hard drive failure -- "Operating system not found" error upon reboot. I loaded the latest dump files of the databases hosted by that server onto the remaining two servers, made the necessary configuration changes to SQL tables and to the DM program's .INI file, distributed the new .INI file to all workstations, and had the firm back in business within 45 minutes. I had planned to spend the remainder of the morning rebuilding the failed server and reloading the databases it hosted. WRONG! The assistant MIS director who managed the help desk, in conference with her staff, determined that the server had not failed after all, that the problem lay elsewhere, and that I had responded inapprorpiately to the crisis. So I had to spend several hours carefully and patiently drafting a memo to her explaining that a crash had indeed occurred and that my response was the best possible course of action under the circumstances. I concluded my memo with the sentence "I have neither the time nor the inclination to elaborate further." That remark passed into the lore of my department and is still used by my former staff even after I have been retired for almost a year now.

Meesha
Meesha

However, we currently use Sophos and as with all third party bolt on products they must be rigorously tested. We used Symantec previously but found that it just wasn't up to the task. SMTP at the best of times can be tricky so the best solution is to ensure you have the most qualified staff on your mission critical systems. Certification does not always translate to expertise. A seasoned certified professional is worth every penny. Our Domino Administrators (2) are certified and seasoned and have been truly invaluable to ensuring that the systems are always "up" and providing administration efficiencies. Our former Exchange administrators (3 1/2) were much the same, However, they were faced with far more challenges in keeping the systems "up", finding those same type of process efficiencies and keeping the costs down. Empirically, we did our analysis side by side and in this case, Domino was better suited, matched and won. Is this meant as a blanket for all IT shops? No, just proof that not everything is MS centric.

Meesha
Meesha

Just know a good thing when I have it. 1) We are not a 100% MS shop. 2) MS doesn't like to play well with others with the exception of Dot Net which helps with some of the issues. 3) Empirical evidence to date has supported migration from Exchange to Domino. 4) Costs related to the MS "stuff" (and especially clustering) is clearly far more expensive in both acquisition and in support/maintenance, i.e. servers need to be brought down to service - patches, fixes, security, etc. So, if the above means I'm an open source zealot, I'm wondering why that is a bad thing?

Photogenic Memory
Photogenic Memory

Yeah, I think you can configure sendmail, postfix, qmail, or other open source email clients to give up the answers when it comes to problems through monitoring software. I think most people try to have the server ports monitored through Nagios or other progs to see if it struggling. You gotta love the phone calls though. You gotta love angry Sales people. Quite motivating!!! LOL! Thanks for the update. Glad that you found my humor in the situation. As far as the email filter that hosed everything; it's just another aqdditive to the mechanism. Your lucky you didn't have start from the bottom of the troubleshoointg process ( aggravating! ) and find out sometime later it was the filter. Good old windows for you! You gotta love MS OS's and apps that complain. With open source; you can end up in the dark until someone comes by a shines the light. It's sad but I still keep going back for more torture/enlightnement(I think), hahaha!

tmalonemcse
tmalonemcse

In fact, you almost described my environment to the T. Thanks for providing that overview. Well done and much appreciated. It adds a lot to the original post.

tim uk
tim uk

Most appreciated, that's very useful indeed.

WiseITOne
WiseITOne

Let's face it, before it was the phone, or snail mail. Now we are content with non face-to-face interaction. Even IT relies on email to communicate. I just came into my new job and about a month ago the mail server crashed at our corporate office. I couldn't do much to help other than get some feedback from the corporate office...I had to call them to find out when it would be resolved and get updates. It would have been nice had they notified the remote offices. It took about an hour or two to restore the issue.

tmalonemcse
tmalonemcse

I have been fighting IM for about as long as I can now. Over half the company is asking for it and the boss still says no. He is concerned about IM spam & viruses, lack of a trail for important communications (employee to client or employee to vendor) and general abuse & misuse. Have you looked at several alternatives and decided on on IM solution? If so, it it something inside that you can track or just using outside services?

billbohlen@hallmarkchannl
billbohlen@hallmarkchannl

We have the same problem - e-mail and Blackberry is the most business-critical application in our company. We are trying to implement better ways for our users to collaborate. We are putting enterprise IM in place to get rid of the "one sentence" e-mail conversations clogging our servers. We are also implementing Sharepoint portals so that departments can communicate and share without using e-mail.

HoagieBP
HoagieBP

You are too right. When I first started with this company there was one e-mail address for all 24 employees. No one could see the need for e-mail. One of my tasks was to determine which fellow employee an incoming message was directed to, print it out (thereby negating one of the primary reasons for 'electronic' mail) and sneaker net it to them. Times have changed. Now, if Exchange hiccups the phones light up. We often hear the dreaded, yet highly technical phrase, "e-mail seems slower today", and thunder crashes should we announce a need to interrupt e-mail for even a quick second during the business day. My all time favorite management response to an e-mail issue came once when Exchange 5.5 locked up unexpectedly during the day. "Okay, if you need to reboot it do so. Just be sure to send out an e-mail first."

Editor's Picks