Troubleshooting Active Directory replication problems
In Windows Server 2003, the replication process is responsible for keeping each domain controller updated with the latest Active Directory information. The replication process is also responsible for keeping DNS replicas synchronized. As you can see, replication is a very important part of the Windows Server 2003 network operating system. So what do you do when replication fails? For that matter, how do you even know when a failure has occurred? Here are some answers to these questions and how to fix the replication process.
How does replication work?
Before you can fix the replication process, you need to understand how it works. As I mentioned earlier, replication is used to keep both domain controllers and DFS replicas synchronized. There are a few other tasks that use replication as well. For the purposes of this article, I will focus my discussion on Active Directory replication that occurs between domain controllers.
If you have ever worked with Windows NT, then you are probably familiar with the PDC and BDC domain controller roles. In such an environment, if someone needs to make an update to the Security Accounts Manager, the update gets applied to the PDC. The PDC then alerts the BDCs to the update and the BDCs download the updates and use them to update their own copies of the Security Accounts Manager. This structure is known as single master replication.
In contrast, Windows 2000 and Windows 2003 use multi-master replication. In multi-master replication, there is no PDC or BDC. Every domain controller contains a writable copy of the Active Directory database. If an administrator makes an update to the Active Directory, the update is applied to the closest available domain controller. The domain controller then uses the replication process to apply the update to the other domain controllers.
Because of the multi-master replication model, the Active Directory must have a technique for resolving conflicts. For example, suppose that two different administrators are making changes to the same attribute of the same user account at the same time. Now, suppose that those changes get written to two different domain controllers. When the next replication cycle occurs, you will have two domain controllers attempting to write contradictory data to the other domain controllers.
To get around this problem, Windows relies on a "most recent change wins" mentality. This means that Windows looks at the timestamp for both changes. Whichever of the two changes was made most recently will be the change that takes precedence. The other change will be overwritten.
I mention this because I've seen situations in which two administrators try to apply updates to user accounts and can't figure out why some of their changes are undone. If you suspect that you might have a replication problem, do a little checking to make sure that two or more people are not trying to update the same information at the same time.
Another aspect of replication that I want to touch on is something called Inter-site replication. Inter-site replication is domain controller replication across two or more sites.
The idea behind Active Directory sites is that you want to avoid congesting slow WAN links with excessive replication traffic. Imagine for a moment that you have a domain spanning two offices and that each of the two offices has ten domain controllers. Also, imagine that these two offices are separated by a slow WAN link.
In a situation like this, every time anyone makes a change to the Active Directory, the change is replicated to nineteen other domain controllers. It also means that, since there are nineteen other domain controllers that have to be updated, nineteen different copies of the same data are flowing across your network. To make matters worse, ten separate copies of the same identical data are flowing across your WAN link.
Now, imagine that someone is performing an Active Directory-intensive process, such as creating a hundred new user accounts. This process would cause at least a thousand different update sequences to flow across your WAN link. It is very possible that all of this traffic could choke out the link, preventing other, more important, traffic from flowing across it.
The solution to this problem is to divide the domain into two sites. In a situation like this, one domain controller in each environment is designated as a bridgehead server. The bridgehead server is responsible for sending and receiving batches of Active Directory updates. To see how sites work, let's return to my example of the company with ten domain controllers in each office, separated by a WAN link.
In this situation, if someone in an office made an update to a domain controller, only nine updates would be sent out instead of nineteen. These updates are designed to update the domain controllers in the local site. Remember, however, that one of these domain controllers is acting as the bridgehead server for the site. The bridgehead server receives the updates and then sends a single copy of the update across the WAN link to the remoter bridgehead server. The remote bridgehead server receives the update and then distributes it to the domain controllers in the remote domain.
As you can see, only a single copy of the update was transmitted across the WAN link instead of ten separate copies. When implemented correctly, sites can drastically reduce replication-related network traffic.
Anytime that you make an Active Directory update and the update isn't accessible to those accessing other domain controllers within a reasonable amount of time, there's a chance that you may have a replication problem. For example, imagine that an Administrator creates a new user account. The Administrator then calls the user to say that the new user account should be ready to use within about 20 minutes (after the next replication cycle completes), After about half an hour, the user calls back and says that she can't log in because Windows is telling her that her account doesn't exist. The Administrator checks and, sure enough, the account exists. In this case, the account exists on the domain controller that the Administrator is connected to, but the account has yet to be replicated to the domain controller that is processing the user's login, thus giving the illusion that the account doesn't exist.
If the company only has a few domain controllers, the administrator can actually use the Active Directory Users And Computers console to see which domain controllers the account has been written to. To do so, simply right-click on the domain name and select the Connect To Domain Controller command from the resulting shortcut menu. In doing so, the administrator will be able to connect individually to various domain controllers and see if the new account has been replicated.
This technique works great for small organizations, but what if your domain has 200 domain controllers? You don't want to have to individually check each one. This is where a tool called the Replication Monitor comes in. The Replication Monitor is a tool that allows you to see exactly what is happening with the replication process. It allows you to view the status of Active Directory replication and force replication if necessary.
The Replication Monitor is one of the Windows 2003 Support Tools and, therefore, isn't installed automatically as part of the operating system. (This tool is also included in the Windows 2000 Support Tools.) To install the Windows 2003 Support Tools, insert your Windows 2003 Server CD. Now, open My Computer and browse the CD's contents. Navigate to the CD's \SUPPORT\TOOLS folder, and then run the SUPTOOLS.MSI file.
When installation completes, there will be an option for the Support Tools on the Start | All Programs menu, but the Replication Monitor is not listed on this menu. To open the Replication Monitor, you must go to the \PROGRAM FILES\SUPPORT TOOLS folder and run the REPLMON.EXE file.
When the Replication Monitor opens, you'll see a big, mostly empty screen. This console is divided into two columns. The column on the left simply says Monitored Servers, and the column on the right says Log. In a large organization if all domain controllers were automatically monitored, there would be so much data displayed that it would be very difficult to make sense of it all.
The first time I ever used the Replication Monitor, I was slightly upset that I was unable to automatically monitor all of my domain controllers. After all, I wanted a tool that would tell me where replication was failing, not a tool that would make me guess which server was failing and would then tell me if my guess was right. In a way, though, the Replication Monitor does tell you which server is failing.
Let's go back to the situation in which the Administrator creates a user account but the user can't access the account because it has never been replicated. In a situation like this, you can use the replication monitor in conjunction with the information that you know to figure out which domain controllers are failing to receive replication updates.
For example, the administrator knows that the domain controller on which he created the account has a copy of the account. The administrator can even find out which domain controller he is connected to by using the Connect To Domain Controller option in the Active Directory Users And Computers console. Upon selecting this option, the console will tell you which domain controller you are currently connected to before asking you which domain controller you would like to connect to.
The other useful tidbit of information in this situation is the user's physical location. By looking at which building the user is located in, the Administrator can determine if the user is trying to authenticate through a domain controller in the same site as the administrator's domain controller or through a domain controller in a remote site. For the sake of argument, let's assume that both the user and the administrator are in the same building and are, therefore, accessing domain controllers in a common site.
In a situation like this, every domain controller in a site sends updates to every other domain controller in the site. The administrator knows that the domain controller he is attached to is functional, so he can tell the Replication Monitor to monitor that domain controller. He can then watch to see which domain controllers fail to be updated. If there is a failure replicating Active Directory information to all of the other domain controllers, then the administrator's domain controller is probably the one with the problem. If, however, only one domain controller fails to receive updates, then that's the domain controller with the problems.
To perform such an operation, right-click on the Monitored Servers container within the Replication Monitor and select the Add Monitored Server command from the resulting shortcut menu. This will cause Windows to display the Add Server To Monitor dialog box. You can either enter the server's name directly or you can select the server from a list. Upon entering the server name, Windows will display the Active Directory in tree form. You will notice in Figure A that multiple domains are listed.
Expand the desired domain and you will see the other domain controllers in this domain. If you look at Figure A, you will notice that there is a red X over the icon for server Brien. In this case, I have purposefully taken this server offline so that you can see what a replication failure looks like. If you select the failing server, you can see log information that gives you additional information about the failure.
In a situation like this, the first thing you would want to do is right-click on the failing server, and select the Synchronize With This Replication Partner command from the resulting shortcut menu. When you do, the Replication Monitor will attempt to force replication. Of course, in this case, forcing replication is impossible because the server is down.
Fixing a replication problem
Once you have identified the problem server, the next step is to fix the problem. In every real life replication failure that I have ever seen, the problem was one of three things: the server was down; the server was having trouble with network communications; or the server's hard disk was full.
Therefore, I recommend going to the server and checking out the basics. Make sure that the server has plenty of hard disk space. Next, make sure that you can ping the functional domain controllers. It's important that you be able to ping by both IP address and host name. If you find that you can ping by IP address but not by host name, then it's likely that the machine is having trouble communicating with a DNS server. Make sure that TCP/IP is configured correctly and that the server's designated DNS server is functional.
If everything checks out on the server, but it still can't receive replication updates, you are not completely out of luck. The truth is that there are quite a few less common problems that can cause replication troubles. This is especially true if you are dealing with replication across a site link. For example, when replicating across a site, your designated bridgehead server may be too busy to effectively handle its bridgehead duties. You can find a description of these less common problems and their solutions in Microsoft's TechNet.