What you need to know about how Windows Server 2003's File Replication Service works
One of the key components in any of the Windows Server operating systems is the File Replication Service (FRS). The File Replication Service has two main responsibilities: It keeps replicas within a Distributed File System (DFS) consistent, and it replicates Active Directory (AD) updates between domain controllers. In order to get the most benefit out of FRS and be able to troubleshoot it should problems arise, you need to understand how FRS works. Here's what you need to know.
Before I get into the hard-core inner workings of FRS, I want to go over a few basics of how FRS works. As you know, FRS is used to keep replicas up to date within a DFS tree and to keep the AD information within domain controllers synchronized. Technically, AD replication and file replication are two different mechanisms. However, they work so similarly that if you understand how one works, you'll basically understand the other. For the purposes of this article, I'll focus on file replication.
Like AD replication, file replication uses a multimaster model. This means that updates can occur independently on any server within the domain. There's no need for the updates to be made to a primary server and then be distributed out to other servers.
Two other important ideas you need to be aware of are that FRS is multithreaded, and that FRS works at the file level. This means that individual changes within a file are not replicated. If even one byte within a file is changed, the entire file must be replicated. Since the process is multithreaded, though, FRS is able to replicate changed files to multiple computers simultaneously. You never know what order changed files will arrive in, or which system the updated files will arrive at first. Although the replication cycle is scheduled, for all practical purposes, the update order is random.
So what happens when a replicated file changes? If a user modifies a file that is configured to be replicated to one or more other servers, FRS waits until the user closes the file before it does anything. If a user still has a modified file open when a replication cycle occurs, the file will not be replicated because it's considered to be unmodified until it has been closed.
After the user closes the modified file, the NTFS file system makes a record in its change journal that the file has been modified. FRS relies on this change journal for its information. The nice thing about this is that even if the server goes down unexpectedly, replication of modified files is still possible when the server comes back up because FRS gets its file modification information from the change journal, which is a part of the file system.
Next, FRS waits for the next scheduled replication cycle to occur. When it does, FRS replicates the changed files to the rest of the servers in the replication set through the use of the TCP/IP protocol. To ensure secure communications, FRS relies on an authenticated Remote Procedure Call (RPC) with Kerberos to encrypt the files before they are transmitted.
Whenever you're dealing with replica files, there's a chance for conflicts to occur. For example, suppose the same file exists on Server A and Server B. It's entirely possible that two users could modify the two different copies of the file at the same time. FRS does nothing to lock the other replicas when a replica is being modified. Furthermore, if two files are simultaneously modified, FRS does not attempt to merge the modifications. Instead, the most recent modification takes precedence.
FRS has a very interesting way of determining which file is the most recent. Suppose User 1 modified a file on Server A, and slightly thereafter, User 2 modified the same file on Server B. Obviously, User 2 made the most recent modification, and that's the one that should take precedence. However, because of replication latency, it's possible that User 2’s modification could reach Server C before User 1’s modification. You really don’t want Server C to consider User 1’s modification to be the most recent, so FRS uses what’s known as the 30 Minute Rule.
The 30 Minute Rule basically states that if the replication cycle tries to overwrite a file on a replica with a modified version, and one version is at least 30 minutes older than the other, then the newer file takes precedence and the older file is discarded. If, on the other hand, the time stamp on the two files is within 30 minutes, then FRS looks at the file’s version number to try to resolve the conflict.
The version number is incremented by one every time the file is modified. In the case of DFS, though, it’s entirely possible that two different versions of the same file could have the same version number. If this happens, FRS looks at the time stamp again but ignores the 30 Minute Rule, and the file with the most recent time stamp takes precedence.
By using this algorithm, the file modified by User 2 would overwrite the replica on Server B because it is newer. However, when the file modified by User 1 arrived, the time stamps and versioning would prove that although the file arrived later than the file modified by User 2, the file was actually modified earlier. This means that the file modified by User 2 is newer, and the file modified by User 1 would be discarded.
Before I move beyond the basics, I should point out that there is no user interface specifically for the FRS. AD replication is an automated system process. DFS file replication is controlled by the Distributed File System snap-in for the Microsoft Management Console.
FRS operation in detail
Now that you have a basic understanding of how FRS works, I want to discuss FRS operation in greater detail. Let's start with how FRS maintains a list of replication partners. This is one of the few areas where DFS and Active Directory differ. With AD, all domain controllers are automatically considered replication partners. The Knowledge Consistency Checker (KCC) runs periodically and checks the replication partners (domain controllers) to be sure that they're still online. If the KCC detects a failed connection or a domain controller is down, the KCC will automatically adjust the replication topology to the optimum configuration.
DFS, on the other hand, doesn’t use the KCC. Instead, you must define replication sets through the Distributed File System Snap-In. A replication set consists of computers and links. Replication links can transmit data in only one direction.
For example, if a replication link existed from Server A to Server B, data could flow only from A to B. If you wanted replication data to also be able to flow from B to A, you'd have to create a second link going in the opposite direction. Being that replication links are unidirectional, Microsoft refers to the server that’s sending the data as the Outbound Partner, while the server receiving the data is referred to as the Inbound Partner.
Now, let’s take a quick look at the overall replication process. When a user modifies a replica, the NTFS file system makes an entry in its change journal at the time when the file is closed. Meanwhile, FRS is monitoring the change journal. FRS makes a list of closed files and then filters the list so that it looks only at those files that exist within replicated shares.
Next, a mechanism—the aging cache—comes into play. The aging cache is a three-second timer. Its sole purpose is to keep the FRS from being bogged down when a file is rapidly changing. The aging cache ensures that a rapidly changing file is staged for replication only once every three seconds.
The server writes an entry into the inbound log regarding the change. The inbound log is normally used to keep track of modifications that have occurred on other replication partners so that those changes may be applied locally. The inbound log is basically used to tell the server about the change. It records the filename and the date/time of the modification.
However, although the change occurred to the local replica, it's still written to the server’s inbound log. An entry is also written in an ID table so that the system can recover itself if a crash occurs. I'll talk more about this table later on.
The server then writes a copy of the changed file to a staging directory. This directory is an area on the local server that is designed to temporarily store files until they can be replicated to the other servers. The reason that data is staged, rather than simply being replicated from its original location, is that the original file could be accessed (locked) by a user at any time. On the other hand, by transmitting a copy instead of the original file, Windows can guarantee that the copy is not in use. While in the staging area, Windows also encapsulates the file and replicates the NTFS attributes that go with the file.
Next, the server updates its outbound log. The outbound log is a log file containing a list of outbound replicas. Depending on the network topology, the items in the outbound log can be generated locally or by an inbound partner. An inbound partner would be able to place items in the server’s outbound log if the server were responsible for retransmitting the file to another replication partner. Finally, the server transmits a change notification message to another replication partner.
The other server receives the change notification and uses an algorithm to decide whether the changed file is newer than its current version. Assuming that the changed file is newer, the server asks the computer containing the changed file for the file. When the file is received, the server copies the file to its own staging directory while it updates its outbound log file. The server uses a staging area so that users do not see the file as being locked while it is being downloaded from the other server. Finally, the received file is reconstructed within the staging area and then moved to its final location.
In the section above, I mentioned several log files and various tables. Understanding these tables is crucial to being able to use the various troubleshooting tools effectively. All of the various logs and tables are stored in Microsoft Jet Database format.
The default location for these databases is %SYSTEMROOT%\NTFRS\JET\NTFRS.JDB. The JDB file is the actual Jet database file that contains the various tables. There are five tables in all, and each replication partner has its own independently maintained copy of these tables:
- Connection table
- Inbound log
- Outbound log
- Version vector table
- ID table
It has been my experience that most of the time when replication breaks down, it’s the result of a failed link or a server that’s down. However, when these simple causes don’t apply, the problem is almost always related to information found in one of these tables.
The first table in the database file is the Connection table. This is the table that keeps a record of all the inbound and outbound replication partners. Each link or partner connection uses a separate record within this table.
The next table is the inbound log, which contains all the change orders that have not yet been processed. This table’s records include the filename, the GUID of the change order, object ID, parent ID, event time, and version number.
The outbound log stores all of the change orders that are to be sent to other replication partners. The records structure of the outbound log is identical to that of the inbound log.
The fourth FRS table is the version vector table, which is used to determine how up to date each replica is. This table is updated every time an FRS context is replicated and whenever the outbound log fills up and wraps (the outbound log uses circular logging because it can grow to be very large if one of the replication partners is down).
The final table is the ID table, which maintains a list of all the files in the replica set. Records in the ID table include the filename, GUID, parent file ID, object ID, parent object ID, event time, and current version number.
It's not as bad as it sounds
As you can see, the FRS is fairly complex. However, once you understand the information provided in this article, you should be able to use the various tools provided by Microsoft to troubleshoot FRS problems fairly easily.