I recently visited a
client site to perform some virtualization consulting work. My client relies on
in-house Exchange 2010 servers which are configured in a (supposedly)
fault-tolerant database availability group (DAG) across their two sites: the
primary and the secondary disaster recovery (DR) location. During my stay I got
a front row seat to some serious issues which occurred during a scheduled
Exchange switchover.

The best laid plans

The primary site was due
to undergo an eight-hour power shutdown for electrical maintenance that weekend.
In anticipation of this my client opted to test their ability to “flip”
email processing to their DR site so email services would continue to run
during the outage. This process was called a “datacenter switchover” and
they had tested it before. It involved taking the primary Exchange mailbox servers
out of the DAG then activating the mailbox and Client Access Server (CAS) in
the DR site.

The switchover was
expected to take 30 minutes of work and users had been informed to expect a 3-5
minute window of email downtime. The system administrator, who I’ll call Rick,
started the process during the lunch hour in order to have the least impact
upon users. While this type of work seems better suited for late at night, this
company’s policy was to perform these activities during the day while staff is present
so any issues can be addressed as quickly as possible. As it turned out, issues
appeared immediately.

The “Uh-Oh” moment

Rick kicked off the
switchover by running an Exchange PowerShell command to take his primary
Exchange servers out of the DAG. He then ran a command to activate the Exchange
mailbox server in his DR site – and immediately hit a brick wall. Exchange was
happy enough to pull out the primary Exchange servers but refused to bring up
the DR Exchange server. Errors indicated the mailbox databases couldn’t be
mounted on the DR server since they were supposedly already mounted. After
trying a few more commands Rick found his DAG completely stuck in the mud. He
couldn’t bring up the Exchange databases in either site nor re-add his primary
Exchange servers to the DAG since he received errors that they were already
part of a cluster. Email was completely dead in the water and almost an hour
was rapidly eaten up in frantic event log reviews and Google searches.

Rick did the smart thing
and contacted Microsoft support. They resolved the issue – sort of – by
forcibly mounting the Exchange databases in the DR site and using the Failover
Cluster Manager application to get the primary site Exchange servers back into
the DAG. After three hours of downtime that got email up and running – or did
it?

The post-mortem

I followed up with Rick
later to ask how things went. He was understandably frazzled and not a little
cynical about the experience.

“We found out we
couldn’t run email in our DR site that night,” he told me. “We tested
this by powering down the Exchange servers in the primary site, and nobody
could open Outlook – even though our mailbox and CAS systems were supposedly
working fine in the DR site. That problem lasted until we powered up the
primary site servers again.”

“So, what happened
next?” I asked.

“Well, we still had
the scheduled power shutdown to deal with, and we HAD to make sure email worked
in the DR site or the company would suffer. We brought in dedicated Exchange 2010
consultants to try to figure out what went wrong with the DAG. It seemed to be okay
but we didn’t find any ‘smoking gun’ that explained what why the initial
switchover bombed – even after combing through endless event logs and reviewing
the Exchange best practices analyzer. Our best guess is there was a
synchronization problem among the mailbox databases which confused the DAG, but
the databases also seemed okay again. So, we were down to two choices: try that
same datacenter switchover process over again or come up with something new.”

I could tell by his tone
that they had gone with the latter option. “Why were you reluctant to
follow the same process?” I asked.

“Because we had no
way of knowing it would work,” Rick said. “I already took enough heat
for the three hours of downtime – without being sure that the problem was fixed
I flat-out refused to go down that road again. We did find out why Exchange failed
when our primary site servers went down. We have three servers in our primary
site and two in our DR site. The DAG uses a concept called ‘quorum’ whereby
each server has a vote as to where services are live if something bad happens –
and the majority of votes in either site makes the call. Since we shut down the
site with the majority of votes the DAG was DOA.”

“Sounds needlessly
complex,” I observed.

“It gets more so. Instead
of doing the datacenter switchover our consultants recommended we install a
dummy Exchange mailbox server in our DR site so we could successfully activate
our mailbox databases there.”

“On the dummy
mailbox server?” I asked.

“No, on the actual production
mailbox server, but we needed the additional mailbox system so Exchange could
establish a ‘quorum’ in the DR site of three servers to allow email to work. I
was skeptical – extremely skeptical – that this would succeed, since it seems
to me after 10+ years of supporting Exchange that it waits for any available
opportunity to break down and stab you in the back.”

Let’s analyze that for a
second. Rick had gotten so disenchanted with Exchange that he doubted whether
documented procedures and the Exchange software itself would work as expected. If
a system administrator can’t rely on his own systems I’d say that represents a
significant crisis of faith.

“And did that work?”
I asked.

“Happily, yes. We
got through the power outage after activating our mailbox databases in the DR
site. As far as I’m concerned, though, it’s time to look at other solutions. Weirdly
enough, just the week before I rejected talk about moving email up to a hosting
service since I wanted to keep it in-house. Then this happened.”

“Seems like it all
worked out in the end,” I said.

“Well, after three
hours of email downtime and having to pay outside consultants, sure,” Rick
replied. “However, look how much we lost. People were sitting around
unable to read or reply to urgent customer messages. My group suffered a
reputation blow since we’d announced only five minutes of downtime. Furthermore,
the irony here is that we built an expensive fault tolerant Exchange
environment that blew up the second time we tried to test a DR scenario!”

“I guess you have
to be a full-time Exchange guy to manage this stuff effectively,” I noted.

“That’s what it
boils down to. Oh, I probably could have found there was a problem before that
datacenter switchover if I’d been more careful – I admit that. But I’ve got a
bunch of other stuff going on – Citrix, monitoring, security, you name it. I
don’t have time to babysit Exchange. I could have been working on rolling out a
new app virtualization project, but instead put thirty hours into this – not to
mention all the ‘Outlook is slow’, ‘I can’t connect to email from outside the
company,’ ‘I lost my PST file’ stuff that has taken so much of my days. I’ve put
up with about 48 hours of email downtime in my career – believe me, there’s
nothing less fun. I’m thinking now it’s better to just move this out of the
data center and have done with it. As far as I’m concerned there’s no future in
Exchange.”

“I suppose there is
for hosting providers,” I mentioned.

“I mean for my
career. I used to be leery of anything that seemed like a threat to IT staff. They
came out with outsourcing and we all thought we’d lose our jobs. They developed
virtualization and we all thought that would reduce IT headcount. Now there are
hosted email systems and if anything now I think it would free me up to do the
more meaningful things I just discussed. There is always going to be work for
an IT pro; the question is whether it’s worthwhile or not. Sure, there will
still be some measure of downtime no matter where your data and services are –
that goes with the territory – but the next time it happens I don’t want to be
the guy in the trenches when the dedicated experts ought to be there instead. My
sanity bank account is overdrawn.”

Rick also related other
concerns about the reliability of Active Directory and Windows Server 2008, and
expressed the doubt that Microsoft has a clear understanding as to how to
produce forward-thinking software of genuine value.

Current statistics and future trends

I did some digging after
talking to Rick and found out some interesting statistics which vindicate his
perspective. Technology research organization Gartner predicts by the end of 2014 that “at least 10 percent
of enterprise email will be based on a cloud or software-as-a-service model.
This is continued to rise to at least 33 percent by the end of 2017.”

According to whitepaper
from Rackspace.com titled “The Case
for Hosted Exchange
,”
the following (Figure A) represents
the Monthly TCO (Total Cost of Ownership) figures for on-premise versus hosted exchange:

Figure A

Google
Apps for Business

pricing is even simpler. (Figure B)

Figure B

Personally I prefer Google
Apps hosted email solutions

over Microsoft’s, not because I write for the “Google in the Enterprise”
blog but because I agree with some of Rick’s concerns about the ongoing
relevance of Microsoft and I find the Google pricing scheme more attractive.

In both cases there is
no server hardware, in-house software, data center cost or backup expense. Users
access their data from any location using an array of devices. A scheduled
power outage would have meant little to nothing if Rick’s company had hosted
email in place.

Making the call

Rick’s change in
perspective was probably based both on emotion and logic, but I think he
operated from a rational basis in both categories. Email maintenance is slowly
being perceived as work which is more “custodial” and less “innovative.”
IT professionals are engaged in an ongoing evolution of bringing value to the
business. When it comes down to it, what improves a structure more – mopping
the floor or building an addition?

Hosted email isn’t a
magic solution completely free of drawbacks. There are still significant
concepts such as data migration, access configuration, user training, security,
compliance and SLAs to contend with. A careful analysis of the pros and cons
must be measured by all decision makers whether in IT, Finance, HR, or other
relevant departments. When properly planned
and executed
,
however, it’s not just the data and services which get shifted out of the
organization, but the headaches and distractions as well – and perhaps even the
rekindling of a system administrator’s creative love for technology.

In this case the facts –
and not marketing propaganda or website advertisements – convinced Rick of the
reality of the situation at hand and the next step in his approach to the
future. It will be interesting to see how it pans out for him as well as
organizations elsewhere.