SolutionBase: Troubleshooting the File Replication Service with Sonar

Fix File Replication Service problems with Sonar.

One of the most critical components in a Windows domain is the File Replication Service (FRS). The FRS allows domain controllers to exchange Active Directory data. It also allows servers within a Distributed File System (DFS) to replicate changes as users update files on individual DFS replicas. As with any complex operating system component, the FRS can, and sometimes does, break down. Microsoft makes tools that you can use to monitor the FRS so that you can spot potential problems and take corrective action when necessary. If the FRS has never malfunctioned though, you may have had little reason to implement such tools.

If your FRS suddenly stops working one day and you don’t have any monitoring software in place, then there will be little forensic data to refer to for diagnostic purposes. In such a case, you might consider using Sonar to help find the problem. Sonar is a command line/GUI� tool that can be used to help locate the source of an FRS failure.

Downloading Sonar
You can download Sonar directly from the Microsoft Web site. The Sonar Setup file is only 1 MB in size and should download in under a minute, even on a slow computer. Prior to installing Sonar, you must install the .NET Framework onto the system that will be running Sonar. The .NET Framework is simple to install, but the installation process can take quite a while to complete.

Installing Sonar
Sonar can be installed onto Windows 2000, XP, or 2003. The only catch is that if you install Sonar onto a computer that’s running Windows 2000 Professional or Windows XP, you will have to copy the file NTFRSAPI.DLL from a Windows 2000 Server. The file can be found in the server’s \%SYSTEMROOT%\SYSTEM32 folder and should be copied to the %SYSTEMROOT%\SYSTEM32 folder on the Windows 2000 Professional or Windows XP machine.

After downloading the Sonar Setup file, double-click on it. The necessary files will be extracted and the installation process will begin. When the installation wizard launches, click Next to bypass the Welcome screen. Accept the end user license agreement, click Next, and click Install Now. This will install Sonar into the \Program Files\Resource Kit\ folder. Click Finish to complete the installation process.

Using Sonar
As you might have already guessed, Sonar is an older tool that was originally designed to work with Windows 2000 Servers. As such, Sonar is limited in its capacity. Sonar can not correct problems that it finds. Instead, it passively reads data from the FRS. It’s up to you to figure out the best course of action to take based on that data. If you need an FRS troubleshooting tool that is more full featured, then you would probably be interested in Ultrasound (also available for download from the Microsoft Web site).

When you launch Sonar, you will see the screen that’s shown in Figure A. This screen asks you which domain you are monitoring, the replica set for the domain, and the refresh rate. In this case, I am using the Domain System Volume (SYSVOL) as the replica set. The Domain System Volume is used for the Active Directory. Consequently, I am monitoring Active Directory replication.

Figure A
Select your replica set and domain.

You might have also noticed that the Refresh Rate is set by default to one hour. Yes, that seems a little long, but don’t worry about it too much. You can manually refresh the screen any time you want. The reason why the tool uses such long refresh intervals is because the tool places a lot of traffic on your network during refreshes. The amount of traffic generated by Sonar also means that it’s a good idea not to have multiple copies of Sonar running simultaneously.

In Figure A, there is a check box that you can use to initially display only highly connected (hub) machines. Remember that the FRS depends on unidirectional connectors between machines. If you select this check box, then only servers with multiple connectors attached to them will be displayed. This option isn’t appropriate when examining the Active Directory, because of the way that domain controllers connect to one another.

This would also be a good time to point out that although it’s common to diagnose an FRS problem at the connector level, Sonar displays information at the server level rather than at the connector level. This accomplishes two things. First, it prevents information overload since each server could potentially have a dozen or more connectors. Second, it makes the information easier to digest. You can tell at a glance which servers are having problems and then drill down from there.

The last elements of the dialog box shown in Figure A are the View Results and Load Query buttons. Don’t worry about the Load Query button for now. Since you have not yet saved any queries, there is nothing to load. On the other hand, clicking the View Results button takes you to the main Sonar interface, shown in Figure B.

Figure B
This screen does not mean that everything is working correctly.

Don’t be deceived into thinking that everything is OK if you see the screen shown in Figure B. You will notice in the figure that the server names are listed in the Member column, but all other columns are blank. It’s easy to assume that this screen means that everything is OK because if you look at the Filters column, you will notice that all Rows are selected.

If you look at the Columns drop-down list, you will see that Error Conditions is selected. It’s easy to assume that this means that Sonar is showing all of the servers in the domain that belong to the previously selected replica set and is reporting any existing error conditions. Unfortunately though, you don’t see the true picture until you click the Refresh All button.

As you look at the Error Conditions, the first thing that you need to know is that not all errors point to a problem with your File Replication Service. For example, take a look at Figure C. In this figure, the DataCollectionState is showing Failed for two of my servers. This does not mean that the FRS has failed. It simply means that Sonar has failed to get enough information from the servers to be able to tell whether there’s a problem or not.

Figure C
If the DataCollectionState is showing a failure, it doesn’t necessarily mean that FRS has failed.

To determine why this failure has occurred, you need a little more information. Keep in mind that right now, Sonar is only showing information related to error conditions. If you select All Columns from the Columns drop-down list, you will see some specific information about the failure.

In this case, I am interested in the DataCollectionError column. This column is showing an SCM error for server Brien and an FRSSets error for server cartman. Obviously, these messages are pretty cryptic. The SCM error means that Sonar is having trouble communicating with the Service Control Manager on the remote machine. The FRSSets error indicates that there was a failure of the File Replication Service Sets Remote Procedure Call Interface on the remote machine.

Other errors that you might get in this column also refer to a problem with Sonar rather than a problem with FRS. These errors include:
  • PerfCtr: Failed to read performance counter. This is a known bug in Windows 2000 SP2 and Windows Server 2003. It can be fixed with a simple refresh.
  • Registry: Failed to read the remote registry
  • DS: Failed to query the Active Directory
  • Time Zone: Failed to read the time zone information from the remote machine
  • Proc: Failed to read process information from the remote machine
  • WMI: Failed to receive information from WMI
  • EventLog: Could not read information from the remote machine’s event logs
  • Sysvol: Couldn’t connect to the remote machine’s SYSVOL share
  • FRSVer: Failure of the FRS version RPC interface
  • FRSInlog: Failure of the FRS Inlog RPC Interface

In case you are wondering, the SCM error ended up being caused by a failed NIC in one of my servers. After replacing the NIC, the error went away. The FRSSets problem was not so easy to fix and was unresolved by the deadline for this article.

Filters and Columns
So far you have had a brief overview of how SONAR works. Now, I want to take a closer look at the filters and columns. So far we have filtered All Rows using the Error Conditions columns and All Columns. As you have seen in the first few figures, when you select a set of columns, Sonar displays the columns with information related to that particular subset of information. For example, when I selected the Error Conditions columns, Sonar displayed about a dozen columns that were specifically related to error conditions on the various servers.

As odd as it sounds though, the Error Conditions columns aren’t really designed to help you track down errors. Instead, the error conditions merely report statistics related to errors that might have occurred on various servers. For example, on my test network, two of the three domain controllers in my Test domain were not reporting any errors, yet their data was reported in the error conditions columns.

If you really want to see which servers have errors, then you will want to use the Filters to figure out where the errors exist. Once you know where the errors exist, you can use the Columns drop-down list to display information on the condition causing the errors.

To get an idea of how this works, imagine that you suspected that some of your servers might be backlogged because of replication problems. You could find out which servers had the problems by selecting the Backlogged Members filter from the Filters drop down list. If there were an excessive number of backlogged servers, you could select the Worst Backlogged Members option. When one of these filters is selected, only the servers that are backlogged will be displayed. You can then select the Backlog option from the Columns drop-down list to display information pertinent to backlogging.

If you look at Figure D, you will see that there are no servers on the list. This is because the Backlogged Members filter is selected and there are no servers on my network that are currently backlogged. While you are looking at Figure D, take a look at the various columns. The Member, Site, and DataCollectionState columns are used in each column set. However, columns for OutConnections, Backlog Connections, BacklogFiles, LongJoinCycles, SharingViolations and VVJoinsActiveOutbound now appear in place of columns related to errors.

Figure D
Select the error condition from the Filters list and then choose what columns you want to look at.

If there were a backlog problem, you could use the stats shown on this page to help you to figure out the problem. These statistics are only counters. For example, the BacklogFiles column would display a counter showing how many files are backlogged. It would not list the names of the backlogged files.

As I have explained, the filters allow you to filter the server list by error condition. The filter list is short and fairly self explanatory assuming that you have a good understanding of how the file replication service works. If you need to know more about the FRS, then check out my recent article on the File Replication Service.

Likewise, the various column groupings are fairly self explanatory. What isn’t so obvious is the actual data columns themselves. Some of these columns have some pretty cryptic names. Furthermore, the columns are not fully expanded by default, meaning that part of most of the column names are hidden.

To make understanding the contents of each column easier, I have written a brief description of what each column is below. If you want to follow along, select All Columns.
  • Member: The name of the server that the row applies to
  • DNSSuffix: The DNS suffix that applies to the server that you are looking at
  • Domain: The name of the domain that the server in question belongs to
  • Site: The site that the server resides in
  • DataCollectionState: An indication of whether or not Sonar was able to collect data from the server (I have included a more detailed explanation earlier in this article.)
  • DataCollectionError: The reason why the DataCollectionState reports a failure
  • UpdateTime: The time when the data was most recently updated
  • SCMState: Displays whether or not the Windows Service Control Manager is running the FRS
  • FRSState: The State of the FRS (The state should normally be Active. If you see the FRSState set to Stopped, Error, or JRNL_WRAP_ERROR, then it means that you have a problem. The JRNL_WRAP_ERROR relates to corruption within the NTFS Change Journal.)
  • ReplicaPath: The path to the replica
  • StagingPath: The path to the server’s staging area
  • ServiceStartTime: The last time that FRS was restarted
  • OutConnections: The number of outbound connections to other replication partners (This counter will help you to determine whether or not the server is acting as a replication hub.)
  • InConnections: The number of inbound connections from replication partners to this server
  • OutJoinedConnections: The number of outbound connections currently capable of replicating
  • InJoinedConnections: The number of inbound connections on a server that are currently capable of replicating
  • LastOutJoinIntravalHow: How long it has been since the last outbound partner joined and replicated with this server
  • VerCompiledOn: The date that the FRS code was compiled on
  • VerLatestChanges: The member’s FRS version string
  • BacklogConnections: Comparing this number to the OutConnections will tell you how many of the server’s outbound connections are backlogged
  • BacklogConnectionsDelta: The change in the number of backlogged connections
  • BacklogFiles: The number of backlogged files on the server awaiting replication
  • BacklogFilesDelta: The change in the number of backlogged files
  • USN JournalSize: The size of the USN change journal (The default setting is 0, which corresponds to a default value of either 32 MB [Windows 2000 SP2 and earlier] or 128 MB [Windows 2000 SP3 and later]. Microsoft recommends having 128 MB of change journal space for every 100,000 files in the replica set.)
  • Burflags: The number of backup and restore flags that are active on the server. (This counter indicates how many files have been authoritatively restored and need to be replicated.)
  • WJoinsActiveOutbound: The number of machines that are performing an initial synchronization with replication partners (You could look at this counter for all servers if any of your servers have backlogs. An initial synchronization anywhere on the network can cause temporary backlogs for other servers.)
  • SharingViolations: The number of files that have been replicated to the server from other replication partners, but that have not yet been put into place because the files are in use locally
  • SYSVOLShared: Reports on whether or not the SYSVOL is shared (If the replica set is SYSVOL, then make sure that this column is not set to Not Shared or Not a Junction. Otherwise, there is a SYSVOL related problem.)
  • DiskSpaceReplicaRoot: The number of MB of free space on the replica root
  • DiskSpaceStagingRoot: The number of MB of free space in the server’s staging area.
  • ExcessiveReplicationCycle: A counter of files that have been touched by a process, but not updated (This should be zero.)
  • LongJoinCycle: Indicates how long it took a server to join with replication partners (Normal values are two or three, so look for excessively high numbers in this column as a potential indication of a problem. Keep in mind though that it’s normal to have high numbers if a server is separated by a slow link.)
  • HugeFileCycle: Indicates the number of times in the last 245 hours that a file couldn’t be replicated because the file was bigger than the staging area (This counter should also be zero. If this is anything other than zero, then you’ll need to enlarge your staging area.)
  • StagingFullCycle: Indicates the number of files that could not be replicated in the last 24 hours because the staging area was already full (This counter should also be zero. If this counter is anything other than 0, you will want to consider enlarging the staging area.)
  • RefreshIntraval: How often Sonar data is refreshed