This article is also available as a PDF download.
I recently had to sort out an issue with a failed mirror set
(i.e., RAID 1) on a Windows Server 2003 domain controller. No problem, I
thought. Well, not quite. The mirror had to be deleted, taking everything from
both drives with it. Restoring Active Directory through backup failed. To make
a bad situation worse, the DC was the holder of all the Flexible Single Master
of Operations (FSMO) roles in this (single) domain. Transferring the roles
failed; seizing them was problematic. Disaster recovery? Indeed! This article
will show you how to get such a DC–and the whole domain–back from the brink.
As you’ll see, a disaster recovery plan is about more than generalities.
Disaster scenario
You have your disaster recovery plan all neatly set out.
Then disaster strikes: A Windows Server 2003 domain controller goes down. Okay,
not a train smash; you’ve got up-to-date backups. But restoring Active
Directory via backup fails. Now what? Well, you can still reinstall Server 2003
and restore user data from backup. (The latter works–you’ve checked.)
There’s only one problem: This server was the holder of all
the FSMO roles. So you’re starting to sweat a little, but not too profusely. You
know about transferring FSMO roles to another domain controller. But what if
that fails? Yes, you can try seizing it. At this stage, you’re looking at the
stuff disasters are made of, because now your whole domain teeters on the brink.
(I’ll explain why in a moment.)
Admittedly, this is a very particular (and very unfortunate)
scenario. But then, the nature of a disaster is its unpredictability. And there
are a couple of general lessons to be learned from this specific incident. Here’s
what I did and what I learned along the way.
In this situation, the failed mirror could not be rebuilt in
a nondestructive way (I won’t go into the whys and wherefores here), making
loss of all data on both drives inevitable. I tried restoring AD from backup.
It failed, presumably because the backup software that was used (an old version
for NT) didn’t back up the system state data. Trying to restore with Server
2003’s own backup utility (ntbackup.exe) didn’t work either. It didn’t
recognize the backup format of the legacy software.
Lose the roles and you’re lost
Next, I tried transferring the FSMO roles that were held by
this DC to another DC in the domain. It failed. Then I attempted seizing the
roles, but the error messages I got (Figure
A) did not look promising. I nevertheless attempted seizing every role, and
strangely enough, after completing the whole procedure described below, I saw
that the roles had been seized successfully. (Don’t ask me, ask Microsoft.)
Figure A |
![]() |
The attempt at seizing the roles resulted in these errors. (Note: the domain name, DC/server name, and CN name have been edited out for security reasons.) |
But what happens if you do lose the FSMO roles? Let’s just
say that losing some of them can have bone-chilling implications. For example,
without the RID Master, if you have more than one domain, you won’t–with
immediate effect–be able to move security principals from one domain to
another. You also won’t be able to add new users, groups, and computers to the
domain. You won’t experience the latter problem immediately, as each DC in the
domain has a pool of 512 RIDs. But after that, you’re dead in the water. Now
you’re faced with the prospect rebuilding the whole domain.
Replication to the rescue
So what are your options (apart from re-creating the
domain)? Reinstalling and replicating. If you have a big AD (and maybe slow WAN
links), replication is not an attractive option, but it might be your only
choice. If you have another DC in the same site as the failed DC, you’re in
luck, because replication will be much faster.
Tip: If it will
speed up things, take the DC you’re reinstalling to the same location as the
one you intend replicating from. In my case, the two DCs in the same site were
separated by a wireless link that would have slowed replication down, so I took
the one across.
Reinstall Windows Server 2003 on the failed machine, make it
a DC (run DCPromo), and install and restore whatever other services there were
on the machine, like DHCP, WINS, DNS, and IIS. When you’re finished, start
replicating. Now you’re ready to restore your data.
First, clean up
Before you reinstall Windows Server 2003 on the failed machine
and make it a DC, there’s an important job to do: a metadata cleanup. This
entails removing the dead DC from AD (more technically speaking, removing the
ntdsDSA object). You have to be an Enterprise Administrator to perform this
task.
A word of caution: Be absolutely sure this is the route you
want to take before you do the metadata cleanup. There’s no turning back (at
least none that I’m aware of).
How you perform the cleanup will differ depending on whether
you want to name your new DC the same as the old (failed) one. I suggest retaining
the old name, as it simplifies matters a lot (for example, with shares).
However, if you always wanted to rename that DC, now is the time.
Let’s start with the steps to follow if you want to give the
new DC the same name. In this case, you’ll have to remove the old DC’s ntdsDSA
object.
The commands differ slightly depending on whether the DC in
question has Service Pack 1 (SP1) installed. If SP1 is installed, metadata
cleanup also removes File Replication Service (FRS) connections and as part of
the process, tries to transfer or seize any operations master roles that the
retired DC holds.
- Type ntdsutil at the command prompt.
- At the ntdsutil: prompt, type metadata cleanup
and press [Enter]. - If SP1 is installed, type remove selected server ServerName. (See Figure B.) If SP1 is not installed and you’re using the version of Ntdsutil.exe that’s included with Windows Server 2003
with no service pack, connect to the existing domain controller (in our
case, the one in the same site as the failed DC) on which you want to remove
the failed DC’s ntdsDSA object. To do this, type connections at the metadata cleanup prompt and press [Enter]. - Type connect to server <servername>, where <servername>
is the DC that will be used to clean the metadata, and press [Enter]. It can be
any working DC in the same domain, but we’ll use one in the same site. Figure
C shows this step on a DC that does not have SP1 installed. - Type quit and press [Enter].
- Type select operation target and press [Enter].
- Type list domains and press [Enter]. All
domains in the forest will be listed. - Type select domain <number> and press [Enter].
- Type list sites and press [Enter].
- Type select site <number> (the number of the site in which the DC was a
member) and press [Enter]. - Type list servers in site and press [Enter].
- Type select server <number>, where <number>
is that of the DC to be removed, and press [Enter]. - Type quit and press [Enter].
- Type remove selected server and press [Enter].
- Type quit and press [Enter] until you’re
back at the command prompt.
Figure B |
![]() |
Starting the metadata cleanup process using ntdsutil on a DC with SP1 installed |
FigureC |
![]() |
Starting the metadata cleanup process using ntdsutil on a DC without SP1 installed |
If you’re going to take the plunge and give the DC a new
name, you’ll have to remove the failed server from the Sites & Services and
Users & Computers snap-ins. NB: Don’t do this if the new DC will
have the same name as the failed one.
- Open the
Sites & Services snap-in. - Select the
relevant site. - Delete the
server object representing the failed DC.
- Open the
Users & Computers snap-in. - Select the
domain controllers container. - Delete the
computer object associated with the failed DC.
Lessons
Here are some things you should know, check, and do before
disaster strikes:
This might seem pretty obvious (but how many of us do it…):
Plan for what-if (worst-case) scenarios. That’s what’s meant by
“disaster”, right? Don’t bargain on anything (backups working, etc.)
Outline procedures to recover from disasters like these.
Put a fair amount of detail in your disaster recovery documentation. You need
more than generalities. Have the procedures for tasks like seizing FSMO roles
set out clearly as part of your disaster recovery plan. It will speed up
recovery considerably in case of a crisis.
Even better, test your procedures in the calm environment
of a test lab.
Regularly check that you have what it takes to recover from
a disaster. For instance, how up-to-date is the backup of your system state
data? When it comes to system state data, age matters. If your system state
backup is older than the tombstone age, you’re in for trouble. The default
tombstone lifetime is 60 days. (A tombstone keeps tabs on objects deleted but
not yet completely removed from AD.) To prevent inconsistencies in AD, you’re
prevented from restoring data older than the tombstone lifetime.
Prepare to speed up recovery (and take pressure off
yourself) by making separate backups of DNS and DHCP and all server drivers.
Ensure that your disaster recovery procedure is set out
clearly and systematically, listing the steps to follow and the order in which
things should be done.
Potential pitfalls
Install the relevant service pack(s) and critical updates
immediately after reinstallation. Remember to check shares and permissions. I
also had to restore mapped drives. Also, remember to set up the time service
again if you had to follow the recovery route described above. And just to add
to the fun: If you apply Server 2003’s SP1, you might run into a problem with
the time server service not starting. You’ll find the solution here.