Developer

Read binary files more efficiently using C#

While not used prominently, binary files are still used in legacy systems. Reading binary files requires a certain amount of manipulation that can be made easier with C#; it can automatically fix byte swapping problems throughout an entire structure.


It would be nice to think that everything file transfer has gone to XML; to believe that every file format you encounter today is just another XML schema to understand. But that's not the case. There are still a large number of file formats that aren't XML, or even ASCII. Binary files are still flowing across networks, being stored on disks, and passing between applications, and they’re doing it more efficiently than text files.

In C and C++ reading a binary file was easy. Except for a few carriage return/line feed problems, every file that was read into C/C++ was a binary file. C/C++ really only knew about binary files and how to make a binary file look like a text file. As the languages that we worked with got more and more abstract, we ended up with languages that couldn’t directly and easily read the created files. The languages wanted to automate the process of streaming out data—each in its own unique way.

Defining the problem
In many areas of computer science, C and C++ are still used to store and retrieve data directly from structures of data. It is very simple to read and write to a file from structures in memory in C or C++. In C, all you do is hand fwrite() a pointer to your structure, tell it how many are there, and how long the structure is. It writes it directly to the file in a binary format.

This put the structure into a file and meant that reading the file, if you knew the right structure, was easy as well. You passed fread() the file handle, the pointer to the structure, how many to read, and how long the structure was. The fread() function did everything else for you. Suddenly the structure was back in memory. There was no parsing, no object models—the file read directly into memory.

The two biggest problems to be addressed in the C/C++ days were structure alignment and byte swapping. Structure alignment just meant that sometimes a compiler would skip bytes in the middle of a structure because it would be suboptimal for a processor to access those bytes. So the compiler optimized for speed by skipping bytes and reordering the order of the fields. Byte swapping, on the other hand, referred to the process required to rearrange bytes in a structure due to a potential difference in the way that processors ordered bytes.

Structure alignment
As processors have been able to process more information at one time (within a single clock cycle), they’ve begun to expect that the information they process be lined up in a certain way. Most Intel processors expect that integers (of the 32-bit variety) will align along a 4-byte boundary. They won’t work with integers that don’t exist in memory at an address that isn’t a multiple of four. Compilers know this. So when presented with a structure that would cause an integer to not be lined up on an address that is a multiple of four, compilers have three choices.

First, they can choose to add some nonusable white space into the structure so that the starting address for the integer is a multiple of four. This is the most common implementation. Second, they can rearrange the fields so that the integers are all aligned on a multiple of four. Because this causes some other interesting problems, it's less frequently used. The third option is to allow an integer to be in the structure in a nonmultiple of four and then put code in place to move the integer to and from a scratch space which is a multiple of four. This involves a little extra overhead with each reference, but can be useful when being compact is very important.

For the most part, these are compiler details that you don’t worry about. If you’re using the same compiler with the same options for both the program that writes the data and the program that reads the data, there should be no problems. The compiler will process the same structure the same way and all will be well. But when you’re involved in cross-platform file conversion, it’s important to align everything the right way so that information can be transferred. In contrast, some programmers learned how to get the compiler to leave their structures alone.

Byte swapping: Big endians versus little endians
Big and little endian refers to two different ways that an integer can be stored in a computer. Since an integer is typically more than one byte, the question becomes whether the most significant byte is the one that's read and stored first. The least significant byte is the one that changes most frequently. That is, if you continually add one to an integer the least significant byte changes 256 times as frequently as the next least significant byte.

Different kinds of processors store integers differently. Intel processors typically store integers in little endian format, in other words, little end first. Most other processors store integers in big endian format. So when a binary file is read and written on a different platform, there’s the possibility that you’ll have to flip the bytes around to get the correct order.

This was and still is particularly a problem on UNIX where some variants run on a Sun Sparc processor, some on an HP processor, others on an IBM Power PC, and some on Intel-based chips. Moving from one processor to another means learning when bytes must be swapped so they end up in the order that the processor of the local system expects them.

Challenges with binary files in C#
There are two additional challenges with C# and binary files. The first challenge is the challenge that all .NET languages are strongly typed. So you’ll have to convert a stream of bytes from the file into the data types that you want. The second challenge is that some data types are more complex than they appear on the surface and may need some conversion.

Type breaking
Because .NET languages, including C#, are strongly typed, you can’t just arbitrarily read a number of bytes from a file and jam it into a structure. You’ll have to start by reading the number of bytes you need into an array of bytes and then copying them over to the structure while breaking the type casting rules.

Searching back in Usenet archives, you’ll find several postings in microsoft.public.dotnet hierarchy with a set of routines that will allow you convert any object into a series of bytes and back to the object again. They appear here in Listing A.

Complex data types
In C++, you know what is an object, what is an array, and what isn’t either an object or an array. But in C#, things aren't as simple as they seem. A string is an object; so is an array. Because there are no true arrays and because there is no fixed size of many objects, there are some complex data types that don’t fit neatly into fixed binary structures.

Fortunately, .NET offers a way to resolve this issue—you can tell C# how you want your strings and other types of arrays handled. This is accomplished via the MarshalAs attribute. In the case of a string in C#, the attribute should go immediately above the member to be controlled and should look like this:
[MarshalAs(UnmanagedType.ByValTStr, SizeConst = 50)]

The SizeConst parameter should be changed with the length of the string that you want as it will be stored or retrieved from the binary file. This fixes the string length at some maximum.

Solving classic problems
Now that you know how the .NET-introduced problems are solved, it’s time to see how easily the classic binary file problems are solved.

Packing
Instead of using compiler options to control how structures are arranged, you can assign a StructLayout attribute to a structure to explicitly state how you want that structure arranged or packed. This is particularly useful when you need different structures to be packed differently. It’s much like packing your car. Using the StructLayout is like carefully deciding whether you want to pack everything tightly or if you want to just throw it in and hope it works out. The StructLayout attribute should look like this:
[StructLayout(LayoutKind.Sequential, Pack = 1)]

This causes the layout of the structure to ignore alignment boundaries and pack the structure as tightly as possible. This should correspond with any structure you’re reading from a binary file.

You may find that even adding this attribute doesn’t completely resolve the issues with your structure. In such cases, you’ll probably have to tediously work through the issues by trial and error. One of the reasons we’ve moved away from binary data structures, particularly for cross platform, are the subtle problems that can be caused by the way different computers and compilers handle things at the binary level. .NET is good at adapting to other binary files, but it’s not perfect.

Endian flipping
One of the classic problems with reading and writing binary files is that some computers store the least significant byte first (e.g., Intel) and others store the most significant bit first. In C/C++, you had to manually address this problem by flipping each field one-by-one. One of the great things about the .NET Framework is that the code has access to the metadata for types at runtime so you can read that information and use it to automatically address the endian problem for every field in a structure. The code in Listing B is a basic example of how this can be done.

Once you get the object’s type, you can get the fields within the structure and then proceed to check each one to determine whether it’s an unsigned integer of 16 bits or 32 bits. In either of these cases, the bytes are swapped so, you can swap them back by masking off a byte at a time and rotating it to its new position, and then add everything back together.

Notice that you don’t do anything with strings. Strings aren't affected by the big endian/little endian discussion. Those fields are left unaffected by the flipping code. You also flip only unsigned integers. This is because negative numbers aren't always represented the same way on every system. There is the ones compliment notation for negative numbers and the more popular twos compliment. This makes fixing negative numbers cross platform slightly more difficult. Luckily, negative numbers are rarely communicated in binary files.

Just to make things interesting, floating point numbers are sometimes represented in nonstandard ways as well. Although most systems have settled on the IEEE format for floating point numbers, there are a few, particularly older systems, that use other formats.

Overcome resistance
You can make C# read binary files despite its initial resistance. In fact, C# can be a better language to read in binary files because of the way that it maintains accessible metadata about the objects it works with. Because of this, C# can automatically fix byte swapping problems throughout an entire structure.

Editor's Picks