If a Windows Server 2003 domain controller goes down, can you pull it back from the brink? What if restoring AD via backup fails, the server holds all the FSMO roles, and transferring them to another DC fails, too? Here's how one network manager prevailed despite numerous setbacks.
This article is also available as a PDF download.
I recently had to sort out an issue with a failed mirror set (i.e., RAID 1) on a Windows Server 2003 domain controller. No problem, I thought. Well, not quite. The mirror had to be deleted, taking everything from both drives with it. Restoring Active Directory through backup failed. To make a bad situation worse, the DC was the holder of all the Flexible Single Master of Operations (FSMO) roles in this (single) domain. Transferring the roles failed; seizing them was problematic. Disaster recovery? Indeed! This article will show you how to get such a DC—and the whole domain—back from the brink. As you'll see, a disaster recovery plan is about more than generalities.
You have your disaster recovery plan all neatly set out. Then disaster strikes: A Windows Server 2003 domain controller goes down. Okay, not a train smash; you've got up-to-date backups. But restoring Active Directory via backup fails. Now what? Well, you can still reinstall Server 2003 and restore user data from backup. (The latter works—you've checked.)
There's only one problem: This server was the holder of all the FSMO roles. So you're starting to sweat a little, but not too profusely. You know about transferring FSMO roles to another domain controller. But what if that fails? Yes, you can try seizing it. At this stage, you're looking at the stuff disasters are made of, because now your whole domain teeters on the brink. (I'll explain why in a moment.)
Admittedly, this is a very particular (and very unfortunate) scenario. But then, the nature of a disaster is its unpredictability. And there are a couple of general lessons to be learned from this specific incident. Here's what I did and what I learned along the way.
In this situation, the failed mirror could not be rebuilt in a nondestructive way (I won't go into the whys and wherefores here), making loss of all data on both drives inevitable. I tried restoring AD from backup. It failed, presumably because the backup software that was used (an old version for NT) didn't back up the system state data. Trying to restore with Server 2003's own backup utility (ntbackup.exe) didn't work either. It didn't recognize the backup format of the legacy software.
Lose the roles and you're lost
Next, I tried transferring the FSMO roles that were held by this DC to another DC in the domain. It failed. Then I attempted seizing the roles, but the error messages I got (Figure A) did not look promising. I nevertheless attempted seizing every role, and strangely enough, after completing the whole procedure described below, I saw that the roles had been seized successfully. (Don't ask me, ask Microsoft.)
|The attempt at seizing the roles resulted in these errors. (Note: the domain name, DC/server name, and CN name have been edited out for security reasons.)|
But what happens if you do lose the FSMO roles? Let's just say that losing some of them can have bone-chilling implications. For example, without the RID Master, if you have more than one domain, you won't—with immediate effect—be able to move security principals from one domain to another. You also won't be able to add new users, groups, and computers to the domain. You won't experience the latter problem immediately, as each DC in the domain has a pool of 512 RIDs. But after that, you're dead in the water. Now you're faced with the prospect rebuilding the whole domain.
Replication to the rescue
So what are your options (apart from re-creating the domain)? Reinstalling and replicating. If you have a big AD (and maybe slow WAN links), replication is not an attractive option, but it might be your only choice. If you have another DC in the same site as the failed DC, you're in luck, because replication will be much faster.
Tip: If it will speed up things, take the DC you're reinstalling to the same location as the one you intend replicating from. In my case, the two DCs in the same site were separated by a wireless link that would have slowed replication down, so I took the one across.
Reinstall Windows Server 2003 on the failed machine, make it a DC (run DCPromo), and install and restore whatever other services there were on the machine, like DHCP, WINS, DNS, and IIS. When you're finished, start replicating. Now you're ready to restore your data.
First, clean up
Before you reinstall Windows Server 2003 on the failed machine and make it a DC, there's an important job to do: a metadata cleanup. This entails removing the dead DC from AD (more technically speaking, removing the ntdsDSA object). You have to be an Enterprise Administrator to perform this task.
A word of caution: Be absolutely sure this is the route you want to take before you do the metadata cleanup. There's no turning back (at least none that I'm aware of).
How you perform the cleanup will differ depending on whether you want to name your new DC the same as the old (failed) one. I suggest retaining the old name, as it simplifies matters a lot (for example, with shares). However, if you always wanted to rename that DC, now is the time.
Let's start with the steps to follow if you want to give the new DC the same name. In this case, you'll have to remove the old DC's ntdsDSA object.
The commands differ slightly depending on whether the DC in question has Service Pack 1 (SP1) installed. If SP1 is installed, metadata cleanup also removes File Replication Service (FRS) connections and as part of the process, tries to transfer or seize any operations master roles that the retired DC holds.
- Type ntdsutil at the command prompt.
- At the ntdsutil: prompt, type metadata cleanup and press [Enter].
- If SP1 is installed, type remove selected server ServerName. (See Figure B.) If SP1 is not installed and you're using the version of Ntdsutil.exe that's included with Windows Server 2003 with no service pack, connect to the existing domain controller (in our case, the one in the same site as the failed DC) on which you want to remove the failed DC's ntdsDSA object. To do this, type connections at the metadata cleanup prompt and press [Enter].
- Type connect to server <servername>, where <servername> is the DC that will be used to clean the metadata, and press [Enter]. It can be any working DC in the same domain, but we'll use one in the same site. Figure C shows this step on a DC that does not have SP1 installed.
- Type quit and press [Enter].
- Type select operation target and press [Enter].
- Type list domains and press [Enter]. All domains in the forest will be listed.
- Type select domain <number> and press [Enter].
- Type list sites and press [Enter].
- Type select site <number> (the number of the site in which the DC was a member) and press [Enter].
- Type list servers in site and press [Enter].
- Type select server <number>, where <number> is that of the DC to be removed, and press [Enter].
- Type quit and press [Enter].
- Type remove selected server and press [Enter].
- Type quit and press [Enter] until you're back at the command prompt.
|Starting the metadata cleanup process using ntdsutil on a DC with SP1 installed|
|Starting the metadata cleanup process using ntdsutil on a DC without SP1 installed|
If you're going to take the plunge and give the DC a new name, you'll have to remove the failed server from the Sites & Services and Users & Computers snap-ins. NB: Don't do this if the new DC will have the same name as the failed one.
- Open the Sites & Services snap-in.
- Select the relevant site.
- Delete the server object representing the failed DC.
- Open the Users & Computers snap-in.
- Select the domain controllers container.
- Delete the computer object associated with the failed DC.
Here are some things you should know, check, and do before disaster strikes:
This might seem pretty obvious (but how many of us do it...): Plan for what-if (worst-case) scenarios. That's what's meant by "disaster", right? Don't bargain on anything (backups working, etc.)
Outline procedures to recover from disasters like these. Put a fair amount of detail in your disaster recovery documentation. You need more than generalities. Have the procedures for tasks like seizing FSMO roles set out clearly as part of your disaster recovery plan. It will speed up recovery considerably in case of a crisis.
Even better, test your procedures in the calm environment of a test lab.
Regularly check that you have what it takes to recover from a disaster. For instance, how up-to-date is the backup of your system state data? When it comes to system state data, age matters. If your system state backup is older than the tombstone age, you're in for trouble. The default tombstone lifetime is 60 days. (A tombstone keeps tabs on objects deleted but not yet completely removed from AD.) To prevent inconsistencies in AD, you're prevented from restoring data older than the tombstone lifetime.
Prepare to speed up recovery (and take pressure off yourself) by making separate backups of DNS and DHCP and all server drivers.
Ensure that your disaster recovery procedure is set out clearly and systematically, listing the steps to follow and the order in which things should be done.
Install the relevant service pack(s) and critical updates immediately after reinstallation. Remember to check shares and permissions. I also had to restore mapped drives. Also, remember to set up the time service again if you had to follow the recovery route described above. And just to add to the fun: If you apply Server 2003's SP1, you might run into a problem with the time server service not starting. You'll find the solution here.