A weekend is not a weekend if you have to spend it in a vain attempt to reconstruct a set of Exchange databases that were corrupted when a power outage took the server down.
Life is particularly cruel when you’ve gone to the trouble of using an uninterruptible power supply (UPS) and have tape backups of the data that goes into the databases—only to have both of those safeguards fail you.
In this week’s From the Trenches column, we will follow Peter through an e-mail server catastrophe that taught him a few hard lessons about good intentions and limited time.
Get insights From the Trenches
You can learn quite a bit by reading about the methods other administrators and engineers use to resolve challenging technology issues. Our hope is that this column will provide you with unique solutions and valuable techniques that can help you become a better IT professional. If you have an experience that would be a good candidate for a future From the Trenches column, please e-mail us. All administrators and their companies remain anonymous in this column so that no sensitive company or network information is revealed.
The unexpected happens
You may be thinking that an experienced Exchange administrator like Peter, who has many years of field experience with the Exchange servers, would be able to foresee and prevent any kind of data loss due to a catastrophic loss of a server. But being aware of the potential of loss and having the time and funding to prevent every possible disaster from happening are two different things.
Peter took the precautions he thought were appropriate for his environment. He works for a Web-based business in the southern United States, relatively free of the potential for natural disasters that plague other areas of the country.
“Most places have some sort of natural occurrence to contend with,” Peter said. “Typically, in the U.S., Canada, and most of Europe, the infrastructure is built to withstand and respond to these.”
Peter felt that having a UPS to allow graceful Exchange shutdowns and making daily tape backups of his Exchange 5.5 database and transactions logs would see him through any unexpected problems.
Then, one Friday, he found out how fragile his system was. At 9:51 A.M., the power went out unexpectedly in his company’s building. Through the darkened hallways, Peter went to the server room where he found the UPS was doing its thing. Before he could get to the server and initiate its shutdown sequence, the power came back on in the building.
The return of the power was short-lived, about a minute, and again the power went out. The 15-minute capacity of the UPS should have given his Exchange server plenty of time to shut down as he began to shut down the server, but it had lost more than half of that time during the first power outage.
It can take between five and 25 minutes for an Exchange server to close all open transactions, depending on the number of open transactions and the version of the server software. But this time, there wasn’t enough time left on Peter’s UPS to cover what was needed to shut his server down properly.
There was a chance that the Exchange server had finished its critical business before it lost power, and all would be okay when it was powered back up. Peter would have to wait another two hours to find out.
Trashed, hosed, and other accurate descriptions
When power was restored, Peter was not convinced that all would be well with his Exchange server. Microsoft has always been adamant about never pulling the plug on an Exchange server.
Because it might take some time to get the server up and running normally, the IT department began queuing incoming e-mail messages in the company’s Trend Virus Wall server. The department had done this before for routine maintenance on the Exchange server and during virus attacks.
The news was bad when the Exchange server would not start up the way it should have.
“CHKDSK /R on all disks is the first thing to do if there appears to be corruption when the machine doesn’t start properly,” Peter said.
When the server still would not start properly after the disks were repaired with CHKDSK, Peter went looking to make sure his Exchange databases showed up on the computer.
He was looking for a directory called Exchsrvr\mdbdata, where he would see priv.edb and pub.edb, if the databases were there. When he looked, they were there.
Peter’s next step was to try to recover his corrupted databases. To do this, he went to Microsoft’s Knowledge Base and found the detailed article XADM: How to Recover from Information Store Corruption.
Peter checked both databases using the commands described in the article to see whether the databases were consistent and, thus, not corrupted. They were not consistent. He tried to perform a soft recovery to restore transaction log files, and then he ran the command to defrag the files.
When the program reported that the databases were still inconsistent, he considered the next procedure in the article, which included a caution that the steps would be destructive. Microsoft warns that they should not be performed on a live Exchange server.
So before proceeding, he called Microsoft to find out what he should do next. Peter was told to restore his databases from his backup tapes.
“You hope you get your databases back up before you hit the /p point,” he said, referring to a command that begins a destructive recovery. But he hadn’t.
Just when you thought things couldn’t get worse
When restoring from a backup tape, essentially you get the databases when they were last fully backed up. Then, you add all the incremental backups between that last full backup and when the service failed.
If this were successful, Peter would be missing only about two hours of data, or at worst, a morning’s worth of data, from his databases. This is where Peter begins to lose his weekend.
The method he used for performing backups was to have a full database backup performed on Sunday nights. From Monday through Thursday, his backup program would do partial, or incremental, backups.
On the Saturday after the disaster, Peter loaded up the previous Sunday’s tape and prepared to wait about six hours for the tape to restore the 34 gigs of data. But an odd thing happened.
The restore would get through between 13 and 20 gigs of data, and then the tape would fail. He tried several times, at three to four hours per try, to get the tape to successfully read, but it would fail at different locations with each attempt.
On Sunday, Peter returned to the server room to try to work around the bad backup tape. During a full backup, the program not only reproduces the entire database, but it continues to record the transaction logs that are being created during the process—or so he believed. During the incremental backups, only the transaction logs are backed up, but that’s okay because the logs contain all the information that goes into the database.
At this point, Peter thought that if he could go two weeks back, to the last good, full database backup, he could simply string together all the transaction logs between then and the power outage. It didn’t work.
While his backup program had been making a full backup of the database, it wasn’t recording the continuous stream of transaction logs being produced during the time the backup was occurring.
Without a continuum of transaction logs, Peter discovered he would not be able to recover anything after the last consecutive transaction log was completed. He had a gap of about 60 transaction logs, and it was enough to halt the process.
“Where we got hosed was the full backup being on a bad tape,” Peter said.
What could he do? He spent the rest of Sunday and into the wee hours of Monday restoring his Exchange server from the last full, good backup through the incremental backups, until he hit the last full backup and had to stop. He then had to stop queuing e-mail and let those 50,000 or so messages transfer to his Exchange server so that the company would be ready for business on Monday.
On Monday morning when the staff returned, their e-mail was working again, but they were missing 10 days of e-mail and other Outlook items created during that period.
Lessons learned and looking to the future
It was a bitter lesson for Peter to learn that his Exchange server wasn’t protected as well as he had thought. After so many years of Exchange administration and never losing anything, one power outage had embarrassed him. It won’t happen again, if he can prevent it.
He would like to have a backup power generator in case the power ever goes out for an extended period of time again. However, he doesn’t see that in the immediate budget.
Meanwhile, he thinks he can do a few other things to protect his Exchange databases. First, he is now doing full database backups every night. This will solve the missing transaction log problem and ensure that he loses only a day of data, at worst, if he has a bad tape.
He also wants to get a beefier UPS that will last longer than 15 minutes. He thinks it will be cost-effective and easy to justify.
“When you consider the value of your data, the $300 price tag is not much in comparison,” Peter said.
Finally, he plans to take the time to write a script that will automatically start the shutdown sequence for his Exchange server. Most UPS equipment comes with a USB or serial connection so that it can send a signal to the server when a power outage is detected. He can use that signal to initiate the script.
“It’s hard to plan for natural disasters because they are so unpredictable,” Peter said. So now he’s working to make the unpredictability irrelevant.
Have you learned IT lessons the hard way?