Microsoft

Identify and fix the unusual lost delayed-write data error

Just when you think you've stopped all data loss on your Windows 2000 server, it seems like another bug pops up. We?ll show you how to identify and fix a potentially perplexing data loss problem, the lost delayed-write data error.


When you think of data loss, you probably think of the types of problems that occur when a hard disk goes bad or becomes corrupted. Other types of data loss can occur in situations in which you haven’t yet saved that all-important document, a power failure, a system crash, and so on. However, there’s another type of data loss that’s much more mysterious, and potentially a lot more frustrating—the lost delayed-write data error.

How would you feel if your system were running perfectly, and then you received a message stating that the system had simply lost your data? No explanation, no apologies; just a message bluntly telling you that your data is gone for good and that you’d better have a backup. In this Daily Drill Down, I’ll discuss what a lost delayed-write data error is, and how to prevent it or recover from it.

Lost delayed-write data errors occur in the write phase
The first time that I ever saw a delayed-write data failure, I was baffled. After all, hard disks have been around a long time and are a fairly mature technology. I wondered how a system that’s working perfectly one moment could just suddenly start losing data. After doing some research, I learned that delayed-write data failures are usually related to the operating system or the network rather than to the hardware itself.

Most lost delayed-write data errors occur for the same reason—problems in the write phase. As you may already know, Windows NT, 2000, XP, and .NET all ride on top of a kernel. The kernel and the hardware abstraction layer control access to the system’s actual hardware. When an application needs to access a hardware device, such as the hard disk, Windows intercepts the hardware device call, thus preventing the application from accessing the hardware directly.

Meanwhile, Windows is receiving similar device calls from other applications and from the operating system itself. Because it’s being bombarded by all these requests, Windows must schedule device access. Although Windows uses several different methods to schedule disk time, it’s quite common to place data that an application has asked to be written to the hard disk in a memory cache until the system actually has time to write the cache contents to disk. When data is received through the network, it’s also common for Windows to place the data in a cache prior to writing it to the hard disk. Unfortunately, the cache represents a single point of failure.

When a local application needs to write data to the hard disk, Windows may place the data in the cache and tell the application that the data has been written. Windows itself handles disk I/O. The application assumes that Windows has done its job saving the data, so the application continues on its merry way. If anything were to happen to the data from the time that it enters the cache until the time Windows writes the data to disk, a lost delayed-write error could result.

A variety of things could cause the error. Some of the most common causes include:
  • The machine runs out of disk space prior to the cache being emptied.
  • The cache memory becomes corrupt.
  • A power failure causes the server to crash.

It’s also fairly common to have network-related delayed-write data errors. These errors tend to work exactly the same way, except that the data is coming from a network client to the cache rather than from a local application. Network-based errors provide an additional level of complexity since there’s the chance that the client generated the data incorrectly or that the data could have been corrupted during transit. However, CRC checks usually will catch data that was corrupted in transit, and the client can simply regenerate the corrupt packets.

SMB signature-related problems
One of the biggest causes of Windows 2000–related, network-caused delayed-write errors is that the client’s network redirector doesn’t calculate the SMB (server message block) signature correctly. There is a fix for the problem, but before applying the fix, it’s critical that you verify that this is exactly the problem and not a variant of another problem. Begin by verifying that the server is running Windows 2000 and Service Pack 2 or higher. At the time that this article was written, Service Pack 3 existed, but Service Pack 3 doesn’t directly fix the problem.

Once you’ve verified the service pack level, you must look at the server’s System log using Event Viewer. Begin by looking for an event with an ID number of 50 and a source of MrxSMB. If you find such an event, check the description for the following text:
{Delayed-Write Failed}
Windows was unable to save all the data for the file x.
The data has been lost.
This error may be caused by a failure of your computer hardware or network connection. Please try to save this file elsewhere.


At the bottom of the Event Properties window, click the Words radio button and check out the status code in the Data pane. The code should be C00000022 (this translates to STATUS_ACCESS_DENIED).

Before continuing, verify that the event contains all of the elements that I’ve described above. If only some of the elements exist, then you have a different problem, and this fix may make it worse.

If you determine that this is indeed the same problem that you’re having, you have a couple of different options to deal with it. The first option is to call Microsoft’s product support service. Microsoft has developed a hot fix for this problem, but hasn’t included the hot fix in the most recent service pack because the fix is still being tested. You can acquire the fix by contacting Microsoft and asking for the following files:
  • MRXSMB.SYS version 5.0.2195.5754. The file should be 371,344 bytes in size and be date/time stamped as 11:10 on May 8, 2002.
  • RDBSS.SYS. This file is 131,984 bytes in size and was date/time stamped on April 4, 2002 at 16:47.

Microsoft product support
You can contact Microsoft product support for more help. There’s a charge for telephone-based support, but I’ve been told that Microsoft’s policy is to cancel the charges if you’re only asking for a patch rather than for actual assistance with a problem.

If you’re unable to get a copy of the hot fix, or if the hot fix causes problems with other things on your system, there’s a workaround. Once again, you should perform the workaround only if you’re experiencing the exact problem I discussed above. Furthermore, this workaround involves editing the registry. Modifying the registry incorrectly can destroy Windows and/or your applications. Therefore, make sure that you have a full system backup before attempting this procedure.

Open the Registry Editor by entering the REGEDIT command at the Run prompt. When the Registry Editor opens, navigate through the registry tree to HKEY_LOCAL_MACHINE\System\CurrntControlSet\Services\lanmanserver\parameters. Now, double-click the EnableSecuritySignature registry key to open its associated dialog box. This registry key controls whether or not the server uses SMB signatures. Since SMB signatures are causing the problem, you can disable them. Simply replace the 1 in the Value Data field with a 0, click OK, and close the Registry Editor.

Extreme file system stress
Another form of the lost delayed-write data error can occur because the operating system’s file system can’t handle the extremely heavy workload being placed on it. What tends to happen is that during periods of very heavy disk activity, a server thread and a redirector thread may deadlock. This problem tends to be most common in Windows 2000 Datacenter Server environments, especially with the Enterprise Edition of SQL Server 2000. However, the error can (and sometimes does) occur in any version of Windows 2000.

When such an error occurs, you may see the following error message:
Lost Delayed-Write Data

The system was attempting to transfer file data from buffers to Filename.
The write operation failed, and only some of the data may have been written to the file.


Typically, this message is also accompanied by an event being recorded in the system log. Like the previous error I described, the Event ID is 50 and the Event Source is MRXSMB. The event’s description will say:
Description: {Lost Delayed-Write Data} The system was attempting to transfer file data from buffers to \Device\LanmanRedirector. The write operation failed, and only some of the data may have been written to the file.

Another thing to check when attempting to verify that you’re having this particular error is the event’s Data section. If you click the Words radio button in the Data section, you should look for the code C0000020C.

If you locate an event that contains all of (not just some of) the specifics that I’ve mentioned, you can fix the problem by acquiring a hot fix from Microsoft. Once again, the hot fix hasn’t been incorporated into the latest service pack, but you can get the fix (possibly for free) by contacting Microsoft customer support.

When you call Microsoft, you must tell them the specific patch that you need. This particular patch consists of six different files:
  • NTKRNLMP.EXE version 5.0.2195.3573. The file is 1,685,440 bytes in size and is dated May 4, 2001 at 11:48.
  • NTKRNLPA.EXE version 5.0.2195.3573. This 1,685248-byte file carries a date/time stamp of May 4, 2001 at 11:49.
  • NTKRPAMP.EXE version 5.0.2195.3573. This file is 1,705,856 bytes in size and has a date/time stamp of May 4, 2001 at 11:49.
  • NTOSKRNL.EXE version 5.0.2195.3573. This 1,663,360-byte file is date/time stamped May 4, 2001 at 11:48.
  • SRV.SYS version 5.0.2195.3444. This 237,072-byte file was date/time stamped at 11:46 on April 2, 2001.
  • SRVSVC.DLL version 5.0.2195.3407. This 73,488-byte file was stamped at 14:25 on May 4, 2001.

Windows NT-related errors
Up to now I’ve described delayed-write errors in Windows 2000, but Windows NT is far more susceptible to these types of errors than Windows 2000 is. This is due to the way that Windows NT handles network file caching. In a Windows NT environment, when a machine needs to send a file to a network server, the file isn’t immediately transmitted. Instead, the file is placed into a network cache, which is later flushed to the redirector. When this occurs, you’ll see one of these two messages:
Event ID 26:
Application popup: System process-lost Delayed-Write data: the system was attempting to transfer file data from buffers to <filename path>. The write operation failed and only some of the data may have been written to the file.


Or
Event ID 26
Application popup: System process-lost Delayed-Write data: the system was attempting to transfer file data from buffers to <network share>. The write operation failed and only some of the data may have been written to the file.


You may also see some corresponding entries in the system log. Usually you’ll see one of these two events:
  • Event 3013: The redirector has timed out to <servername>.
  • Event 8007: NetWare redirector timed out its request to server <servername>.

You can fix these problems by modifying the registry in a manner that disables network redirector caching. If you decide to make this modification, there are three things that you must remember:
  • These fixes only apply to Windows NT.
  • These fixes may slow down a machine’s network performance since network data will no longer be cached.
  • Working with the registry is dangerous and can destroy Windows and/or your applications if modified incorrectly. Therefore, make a backup before you begin.

To edit the registry, open the Registry Editor by entering the REGEDT32 command at the Run prompt. When the Registry Editor opens, navigate to the HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Rdr\Parameters key. Now, select the Add Value command from the Edit menu. Create a REG_DWORD value named UseWriteBehind, and assign the new key a data value of 0. Using 0 disables write behind caching of write only files for the redirector, while using a value of 1 enables caching.

Next, navigate to HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services \Lanmanworkstation\parameters and select the Add Value command from the Edit menu. Create a value namedUtilizeNTCaching. The data type should be a REG_DWORD, and the data value should be 0. A value of 1 disables the cache manager for the workstation service, while a value of 1 enables it.

Don’t delay in fixing the problem
Delayed-write data errors can be confusing when they first occur. However, when you know what causes them and where to look to verify that they’re occurring, you can formulate a plan of action. With that plan, you can avoid any delay in fixing the problem.

 

Editor's Picks

Free Newsletters, In your Inbox