We've all seen situations in which a server crashes and for one reason or another, there is no usable backup of the data stored on that server. Needless to say, that is a bad position to be in, but what happens if the server that fails is one of the servers that is responsible for keeping Active Directory up and running? Unfortunately, these servers are just as prone to failure as file servers are. Although most of us dutifully back up our data each night, few administrators regularly make full system backups of infrastructure servers. This means that if one of these servers were to fail, you could be left with nothing but your wits to try to bring the server back online.
I know that right now, there are some people who are reading this and patting themselves on the back for having a full backup of each of there servers stashed in a vault somewhere. The problem is that Windows considers system state data to be null and void after 60 days, so unless those backups were made within the last two months, they could be useless. In this article, I will discuss various types of infrastructure server failures, and what you can do about them.
Domain Controller Failure
The first type of failure that I want to discuss is a domain controller failure. Normally, a domain controller failure is relatively harmless. That's because most networks have at least two domain controllers, and if one domain controller goes down, then the other domain controller takes over the failed domain controller's workload.
Sadly, there are exceptions to every rule. For example, if the only domain controller within a site were to fail, your Active Directory probably wouldn't come to a screeching halt, but you probably would notice your WAN link becoming congested with authentication traffic.
Another example of a situation in which a failed domain controller is a major problem is an environment with only a single domain controller. In this situation, the one existing domain controller pretty much is Active Directory, so if this domain controller were to fail then Active Directory is basically gone. I have seen several situations over the years in which a small company had a single Small Business Server, and that server ended up having a major problem for some reason. In most of those cases, no useable infrastructure level backup existed and the server basically had to be reconstructed from scratch.
If your office has more than a handful of users, then chances are that you've probably got more than one domain controller. Even so, there are situations in which a domain controller failure can bring an organization to its knees. That's because not all domain controllers are created equally. For example, a domain controller might also be acting as a global catalog server, a DNS server, or may be holding one of Active Directory's operations master roles.
If you have a domain controller that is performing one of these additional tasks fail, there are usually steps that you can take to bring your network back to a functional state. However, if the failed server was performing multiple additional tasks, such as acting as a domain controller, a DNS Server, and a global catalog server, it can be a lot more difficult to return the network to a functional state. If possible, I recommend taking steps to minimize the impact of a server failure before a failure actually happens. For example, you could designate multiple servers as global catalog servers, or you could create a secondary DNS Server. Of course none of these steps are a good substitute for a current backup.
Thankfully, I have never had to recover from a situation in which an organization's one and only DNS Server has a catastrophic failure. Unfortunately though, that means that I can't give you any advice in this matter that's based on real world experience. Microsoft's knowledgebase and my MCSE training books don't seem to address the issue either.
What I can tell you is that Active Directory is completely dependant on DNS. If your DNS server fails, you basically don't have an Active Directory. The good news is that if you are using Active Directory integrated zones, then your DNS information is actually stored within Active Directory rather than in a folder on the DNS server's hard drive. Assuming that the DNS Server can't be recovered, then the trick is to get another server to act as a DNS Server for the domain.
Like I said, I have never tried this technique myself, but there are several Web sites which indicate that one way to recover from such a problem is to install the DNS services onto one of your remaining domain controllers. As you install the DNS services, you will have to manually create a forward lookup zone baring the name of the domain that the domain controller belongs to. You must then configure the machine's TCP/IP settings so that the DNS server's IP address points to the machine's own IP address.
Once you have done so, the server will supposedly begin populating the DNS server with the necessary records. Once the DNS server is up and running, you will have to point all of the other machines to the newly created DNS server. If you don't have an easy way of doing this, then you can take advantage of the fact that Windows machines can contain multiple IP addresses. Simply assign your old DNS server's IP address to the new DNS server and all of the machines will automatically be pointing to the correct DNS IP address.
Keep in mind though that the information in this section is only based on research. The only part of this section that I know for sure works is the multiple IP address trick. The remainder of the article is based on personal experience and proven techniques though.
Global Catalog Server Failure
In an Active Directory environment, a global catalog failure is a serious problem, but it's a problem with an easy fix. When you create a domain in an Active Directory environment, the first domain controller to be placed in that domain is designated as a global catalog server. By default, Windows does not create any other global catalog servers for the domain. The problem is that if your global catalog server fails, then nobody (except for the Administrator) will be able to log on until the server is either brought back online or a new global catalog server is created.
If your global catalog server has experienced a serious failure with little chance of recovery, then your best option is to designate another domain controller to be a global catalog server. Active Directory doesn't care how many global catalog servers exist within a domain, so you don't have to worry about "stealing" the global catalog server function from another server. Instead, you will just tell an additional domain controller to act as a global catalog server.
To do so, go to a domain controller in the same domain as the failed server and open Active Directory Sites and Services console. Now navigate through the console tree to Active Directory Sites and Services | Sites | Default First Site Name | Servers | the server that you've chosen to act as a global catalog server | NTDS Settings. Now, right click on the NTDS Settings container and select the Properties command from the resulting shortcut menu. When you do, you will see the NTDS Settings Properties sheet. Select the Global Catalog check box found on the properties sheet's General tab, and click OK. After about five minutes, the server will begin to function as a global catalog server.
Operations Master Roles
Earlier I explained that one of the ways in which the failure of a single domain controller could really cripple your network was if the failed domain controller happened to be holding one of Active Directory's operations master roles. In most cases, if a domain controller that's holding an operations master role were to fail, then the effects won't be immediately noticeable, but eventually the failure will impact Active Directory's functionality.
In the sections below, I will briefly describe the various operations master roles and the impact that you can expect should the server that's holding that role fail. As you read the sections below, keep in mind that some operations master roles exist at the domain level while others exist at the server level. This means that a failure could impact either a single domain or the entire organization, depending on the role that the failed server was performing. You must also bear in mind that a server holding an operations master role almost always holds multiple operations master roles.
Domain Naming Master
The domain naming master role is performed only at the forest level. The domain naming master's purpose is to act as an authoritative collection of domain names. When an administrator creates a new domain, it takes a moment for information about the new domain to be replicated across Active Directory. It's theoretically possible that someone else could attempt to create a new domain with the same name before the replication cycle completes. This would cause a major problem because two different domains would have the same name (remember that Windows defines a domain by its GUID, not by its name).
To avoid such problems, Windows uses the Domain Naming Master Role. Anytime that someone creates a new domain, Windows checks the domain naming master to see if a domain by that name presently exists. If no such domain exists, the name is added to the domain naming master before the domain is actually created. Now, if someone else tried to create a domain by the same name before the new domain had replicated, Windows would check the domain naming master and would therefore know that the creation process had already begun.
Typically a domain naming master failure goes unnoticed unless someone tries to create a new domain or remove an existing domain. Such actions will generate an error message if the domain naming master isn't available to perform its task.
Schema Master Role
Like the domain naming master, the schema master is also a forest specific role. The schema master's purpose is to maintain Active Directory schema. Anytime that a change is made to Active Directory schema, the change is applied directly to this server. Like the domain naming master, schema master failures typically go unnoticed until someone (or an application) tries to update the AD schema.
The PDC emulator is a domain specific role. The PDC emulator serves as the primary domain controller in a mixed mode environment. If the PDC emulator fails, the consequences depend on your network. If you have a mixture of Windows NT 4.0 and Windows 2000 / 2003 domain controllers, then having a PDC emulator failure is basically the same as having the PDC fail in a Windows NT environment. You will likely have problems using things like User Manager for Domains, Server Manager, and may have trouble resetting passwords. In a domain running only Windows 2000 / 2003 domain controllers, then a PDC emulator failure is no more catastrophic than if any other domain controller failed.
Relative Identifier Master
The Relative Identifier master role is a domain specific role. The Relative Identifier master is responsible for distributing relative identifiers within the domain. If a relative identifier master fails, the problem is usually noticed when an administrator (or an application) is creating active directory objects.
An administrator will continue to be able to create objects until Windows runs out of relative identifiers. When Windows runs out or relative identifiers, it must get more from the relative identifier master. If the relative identifier master is malfunctioning, the operation will fail and the administrator won't be able to create any more active directory objects within the domain until the problem is fixed.
The infrastructure master role is responsible for maintaining the consistency of objects within the domain and objects within the global catalog. Infrastructure master failures are usually noticed by administrators when they aren't able to move or modify large numbers of objects.
Transferring Operations Master Roles
If you have had a catastrophic domain controller failure, and that domain controller holds an operations master role, then you will need to move the role from the failed server to a functional server. Transferring an operations master role isn't a task to be taken lightly though. Before you even think of trying to transfer an operations master role, you must verify that the failed server really was the server holding the role.
Unfortunately, there isn't one single location that you can check to see which roles are assigned to a server. Instead, you will have to check each role individually. The steps that I'm about to show you can be performed on any functional domain controller within the same domain as the failed domain controller.
You can identify the domain naming master by opening Active Directory Domains and Trusts console. When the console opens right click on Active Directory Domains and Trusts node and select the Operations Master command from the resulting context menu. When you do, you'll see a dialog box that provides you with the name of the server that's currently performing the Domain Naming Master role.
To identify the Schema Master, you must install Active Directory Schema snap in. To do so, you will have to log in as an administrator and open a Command Prompt window. When the window opens, enter the following command:
Now that you have installed the schema management console, enter the MMC command at the Run prompt. When you do, a Microsoft Management Console session will open. Select the Add/Remove Snap In command from the File menu to display the Add/Remove Snap In properties sheet. Now, click the Add button to display a list of all of the available snap ins. Select Active Directory Schema from the list and click the Add button followed by the Close and OK buttons. You'll now see Active Directory Schema snap in displayed within the console.
To display the server that's serving as the schema master, right click on Active Directory Schema node that's located in the column on the left and then select the Operations Master command from the resulting context menu. You'll now see a window that identifies the schema master.
You can identify the PDC emulator, Relative Identifier, and Infrastructure Master for a domain through Active Directory Users and Computers console. To do so, open Active Directory Users and Computers console. When the console opens, right click on Active Directory Users and Computers node in the column on the left, and select the All Tasks | Operations Masters commands from the resulting shortcut menus.
When you view which server is holding a particular role by using one of the methods above, Windows presents you with an option to transfer the role to a different server. However, you will not be able to use the transfer option if the server that's holding the role has failed. You can however seize the role from the failed server and assign it to a different server.
Before I show you how to do this though, you need to understand one thing. Seizing a role should be used as a last resort. You must only seize a role if you can guarantee that the failed server will never be coming back online (using its current Windows installation, you can always install a fresh copy of Windows and reuse the hardware). If you did manage to resurrect the server and brought it online after seizing an operations master role, it would cause some serious problems. I should also warn you that you shouldn't attempt to seize a role unless you have a functional DNS Server and at least one functional global catalog server.
To seize a role, open the command prompt window and enter the following commands:
CONNECT TO SERVER servername
(in this case, servername is the server
that you're going to move the role to)
Now, enter one of the following commands to seize the role:
SEIZE INFRASTRUCTURE MASTER
SSEIZE RID MASTER
SEIZE SCHEMA MASTER
SEIZE DOMAIN NAMING MASTER