[The opinions expressed here are mine alone, and not those of Google, Inc. my current employer.]
I don't often write about my day to day work, but sometimes I run across a problem that is so intransigent that it was a triumph when I finally fixed it. If you take an engineering job in the software industry, this is the kind of thing you might end up working on. If you find this column fun and interesting, then you might be a good candidate for a network engineer. Even if you don't, I hope you'll appreciate the insane level of detail network engineers have to know on your behalf, to make something as simple as "saving a file" work seamlessly across operating systems.
One of the remedies imposed on Microsoft after they lost the European Union workgroup-server antitrust case was the requirement to publish the full specifications for third-party software to interoperate with their operating systems. They are still in the process of doing this, but there are now thousands of pages of documentation out there, in theory fully specifying the Server Message Block/Common Internet File System (SMB/CIFS) protocol that Samba and Windows file servers implement. So surely anyone and their auntie (assuming your auntie is a network engineer :-) can now write their own SMB/CIFS server by just reading this copious documentation. After all, now that it's all documented, how hard can it be?
A bug I fixed this week illustrates why I still think Samba is the leading choice for interoperability between Windows and Linux/UNIX systems. It concerns a strange tale of Microsoft Office and the "Offline Files" remote synchronization feature. "Offline Files" in Microsoft Windows allows a user to save a version of a file they're working on from a remote file system on their local laptop, and have it re-synchronized to a server when they get back online.
A user of Samba reported a bug that showed conclusively that trying to synchronize a Microsoft Office file against a Samba server wasn't working. The Windows client "Sync Center" application kept telling the user that the file on the remote Samba disk had been changed since it was saved, and he knew this wasn't the case.
It got stranger. It only happened with Vista, not with XP or Windows 2003. It only happened with Microsoft Office 2003 (all other versions of Office worked fine). It only reliably happened with Microsoft Excel, no other Microsoft Office application. Have I mentioned how much I hate Microsoft Excel? I quake in fear whenever I see an Excel interoperability bug logged against Samba. That application is perverse in the things it will do to a remote file server.
I looked at my nice new shiny downloaded Microsoft documentation. There was nothing related to this problem in there. The document describing the precise behavior of an NTFS filesystem as seen over the wire from an SMB/CIFS server is yet to be finished. They're still working on it. OK, so let me check what happens when you use this version of Excel to do the very same thing against Windows. Maybe it's a real bug that fails against a Microsoft file server too; stranger things have been known. No, it worked fine against a Windows 2003 server, which to be honest did not surprise me. Microsoft tests the hell out of Microsoft Office before shipping any software that interacts with it in any way.
Time to get out the big guns. A debug log from Samba at our highest logging level, and a network packet capture trace (using the Open Source software "wireshark") of when the problem was happening. Looking at the log didn't show any obvious errors, other than the fact that Excel does an insane number of operations over the network to do something as simple as a "Save File" (if you've ever wondered why Excel is slow, look at what it does over a network). A brief glance at the network capture trace didn't help either, everything looked fine except that on the save operation to the Samba server, Excel strangely decided to abort half way through.
This was getting more interesting. It seemed to be a generic failure of the "Save" operation, nothing to do with the "Sync" feature at all. So let's test saving an Excel file against a Samba share without the "Sync" feature turned on in the client. Surely this must work, we also never ship a version of Samba without testing against Microsoft Office. Yes indeed, a normal save worked fine. So it was something to do with the "Sync" feature. But what could it be?
The only thing to do was to do a second wireshark trace from the client to a Windows 2003 server, and then compare the two packet traces, the "bad" against the "good", packet by packet.
Except of course it's not that easy (nothing in Windows interoperability ever is :-). Due to the differences in response times between servers, slight differences in supported features, and of course the fact that the Samba architecture is completely different from that of the Windows CIFS server, the packet streams soon become very different. But after you've been doing this work for 17 years, you start to recognize the fingerprints of the broad actions that clients are trying to do, even with a protocol as chatty on the network as SMB/CIFS.
It took a couple of weeks of staring at the packet traces, on and off, but I eventually narrowed it down to a difference once Excel had written a temporary file out to the remote disk. Things started to be very different (and obviously wrong) at that exact point. So I started to look at the packets very closely.
The client was trying to set a "created" time stamp, to make the temporary file pretend to have been created at exactly the time as the original file. Now one of the interesting things in writing Samba is that is has to run on top of POSIX. A POSIX system is very different from Windows, so one of the challenges we have is to be able to emulate the different Windows features on top of standard POSIX.
A POSIX file system doesn't have a "create" time stamp, so when we're reporting back to Windows when a file was created, we have to look at all the available time stamps from the system, and just pick the earliest. This has always worked in the past, but maybe we'd finally run into a situation where we need that exact create time stamp as set by the client.
So I spent part of a day adding a temporary "created" time stamp into Samba, only held in memory. If this worked and fixed the bug I'd then find somewhere to store this on disk (probably in an "extended attribute").
No, this still didn't fix it. This was starting to make me very angry as it made no sense. I stared at the packet traces again. Even more closely. Then something jumped out at me.
The SMB/CIFS protocol has a feature where a client can be notified when a change is made on a remote file or directory. It's called a "change notify". Normally it's used to allow a client to discover when another client is modifying the same file system (it's the reason Windows "Explorer" windows spontaneously refresh with new files if a work colleague modifies the directory you're looking at). But even if a client modifies the file itself, the server still must send "change notify" packets to let the client know a file it has just requested to be modified has actually been modified. At the point in the packet stream, just after the create time stamp change was requested, the Windows server was sending a "change notify" packet, but the Samba server was sending the "change notify" after the file was written to instead. It was exactly the same packet, surely that couldn't be the problem?
I looked at our code. As POSIX can't store a created time stamp, if the client requests it to be changed (and no other time stamps) we simply return a success code. But we weren't sending a "change notify" back after this request, as technically we weren't changing the time at this point. Instead we were sending it back after the file write, when we were changing the file. So I added code to send the "change notify" back after the time stamp change.
And the bug disappeared!
I went into one of my colleague's office and kicked the hell out of one of the much loved Google beanbags, all the while screaming obscenities into the air for a good five minutes. He looked on with bemused amusement. I finally calmed down enough to explain the problem. One packet being returned at the wrong time. One single mis-timed packet caused a ripple effect in the Windows client file system software that was seen all the way up in the complex user interface of only that particular version of Excel, when interacting with the "Offline Files" feature, only on Windows Vista.
The remaining task was to add a regression test into our test suite, so that this specific bug is tested for before we release any new versions of Samba. The code isn't done until it's properly tested. But at least the user is now happy.
Interoperability with Windows is hard. But somebody has to do it. And if you're going to do something, you might as well try and do it well (and try and have some fun at the same time :-).
Stop the press. As I go to publish this, the user still occasionally reports the failure even with the patch, just not as often. Looks like there may be a secondary timing effect in play as well. Oh well, no one can say this job is dull.