Software Development

The architecture of a flexible .NET file processing system -- Part 1


This article is also available as a TechRepublic download.

There are many companies that rely on importing and exporting data via files as a communication mechanism with their partners. While this seems like an old and out-dated process in this age of Web services, there are and probably always will be companies using this method for communication.

Many times, these systems are one-off designs that aren't very flexible and pose significant hurdles in terms of maintainability and scalability. In this series of articles, I will lay out the groundwork for an architecture that is able to be both highly dynamic and scalable.

If you have a thorough understanding of C#, message queuing (MSMQ), XML, and serialization, you will have a pretty easy time picking up on the concepts that will be presented. If you are a little weak in one or two of these skills, you should still get a lot of good information from this series. However, you may have a harder time with implementation of the techniques presented.

Requirements

Before we start looking at our architecture, we need to define what this system is meant to do. Below are the requirements for this system:

  • Must be horizontally scalable
  • Must be able to load files of any type (fixed length, delimited, XML, etc.)
  • Must be able to transform these files from one format to another (i.e., unzip/decrypt files)
  • Must parse the files via configurable file layout specifications
  • Must insert the data from files into a database
  • Must be able to accept new processing modules without requiring the whole system to be reset or recompiled
  • Must load files immediately after they are placed on the server

These requirements are very general, and in your situation there will probably be more specific requirements that must be met. If you don't see something you need in the list above, keep reading. This architecture is meant to be highly flexible, and chances are, you will be able to plug in the features that are required in your situation.

The architecture

The diagram in Figure A shows a general overview of the architecture I am presenting in this series of articles:

Figure A -- Diagram 1

Figure A -- Diagram 1

Note: The arrows represent information flow, not necessarily the flow of physical files. 

The following steps are presented in this diagram:

  • Files are dropped onto the system and "Listener" processes detect these files.
  • The Listener Process then sends a message (more on the message construction in the next article) to the Processing Queue to indicate the file has arrived.
  • The Message Router then picks up the message from the Processing Queue and determines what needs to happen to the message.
  • After routing logic has been performed and a decision has been made on the destination of the message, the message is forwarded to its destination queue.
  • After the message arrives at its destination queue, task-specific processes will read it. These task-specific processes are completely independent of the other processes and can do anything they please with the messages. This includes using the messages to unzip the uploaded file, parse the message's associated file into the database, or do some task to the file and send the message back to the processing queue for further processing.

Using this type of technique gives us several advantages in the way of scalability and maintainability:

Using message queues allows us to have the processing applications on separate computers and also enables us to have multiple processing applications pulling from the same queue, thereby spreading out the workload (horizontal scaling).

Instead of routing the file all over the place, we are instead routing lightweight messages that contain information about the file. This eases network traffic between the servers involved with processing and keeps file movement to a minimum, which helps with debugging issues. Keep in mind that there are times when file movement is pretty much required, such as unzipping and decrypting.

Using a Message Router to determine where messages should go makes maintainability easier in the sense that all of your routing logic won't be tied up into the listener processes. This logic will be stored in a database or an XML file -- the layout of this logic will be covered in a future article.

The heart of this architecture is really in the interaction between the Message Router and the task-specific queues. Using this type of technique allows you to easily plug in new task-specific processes to meet changing business requirements.

For example, assume that you're currently just importing flat files directly into the database. Chances are that you're detecting the new file, opening the file, parsing it, and inserting the data into the database. In that type of system, when a new requirement comes up (such as unzipping before parsing), chances are you'll have to either rewrite your current parser to do the unzipping or modify some other code and redeploy the whole solution.

With the architecture shown above, you would simply have to modify the routing logic settings (which aren't hard coded -- this is critical), create a new "Unzipping" task-specific queue, and write an unzipping process to consume messages from that queue. You would not have to change any code in your existing processes or redeploy them. In a corporate environment, this can save a lot of time, as there are often several steps involved with moving code changes into production.

Notes on the architecture

Another point that needs to be made about this system is that after a task-specific process works with the message, a message is almost always sent to the processing queue from the task-specific process. This allows the system to figure out on its own (via the routing logic) what needs to happen to the file next, instead of having the task-specific process figure out what to do.

In my own experience, this type of functionality has saved me more than once. One instance was when our partners were sending zip files that contained other zip files. (We expect a zip file to contain only other normal flat files.) After our unzipping process has decompressed an archive, it will send messages to the processing queue for each file that it extracted. In this case, the files extracted were zip files, and after being sent to the processing queue from the unzipping process, they were sent directly back to the unzipping process to be decompressed.

If we had our logic for where to send the result of unzipping handled completely in the unzipping process, we would not have correctly handled this situation.

In the next article

In the next article, I will describe how the incoming listeners are implemented and explain what type of data will be contained in the messages flowing through the message queues.

17 comments
zs_box
zs_box

Somehow I totally missed all these comments when they were originally posted. As far as speed is concerned, the production implementation of this architecture is faster than we even need it to be. In the case of a Zipped, Encrypted, multiple record file, the process imports that file in less than a couple seconds. That includes unzipping, decrypting, and importing each individual record to the database. Are there faster ways to do it? Sure!! But those implementations are also probably harder to support. I agree with the comments that this series can be applied to frameworks other than .NET. However, I am attempting to help people implement this system using .NET, which is why I mention the use of things like the XmlDocument object. Sorry it took so long to respond!

rwoodruf
rwoodruf

When is the next article coming out?

TheGooch1
TheGooch1

This sounds interesting so far. My company using a n enterprise task scheduler to perform each step to files ( create a group a jobs that perform each step in the correct sequence and issue an alert if any step fails ). But why .Net? This seemslike a design-oriented article, describing a solution that could be written in any language.

tv_p
tv_p

This article sounds very interesting..Nothing completly new. But waiting for the next article.

Tony Hopkinson
Tony Hopkinson

I did something like this in 96 Fixed format files , polled for new ones to process... But those are implementation details in a robust and scalable system

Justin James
Justin James

... makes very, very little sense to me. About a year ago, I wrote a simple file parser. The program was probably 10 - 20 lines of code. It would iterate through each line of a file. If the line started with one character, it would write that line to one file, if it started with a different character it would write the line to a second file. I noticed that it ran fairly slow. So I rewrote it in Perl. Nearly line for line the same code. It literally ran hundreds of times faster. Look at this monolithic archtecture in this piece. If the system ran 100 times faster (and that is before these 3,000 layers), it would need 100 times less scalability. In other words, use the right tool for the job! Processing text? Use Perl. J.Ja

techrepublic.com
techrepublic.com

Bryan, Do you still have these files? I would like to compare C# 4.0 against PERL 5.12 Thanks, Peter.

bsm
bsm

James, I think your suggested approach of generating data in memory and piping it to the other programs is a reasonable solution. My inner mad scientist says: Create a test harness with an inner and outer loop. The outer loop will generate a dataset beginning at 1MB and double in size each time through. The inner loop will run each test application in succession a specified number of times to validate the results. This way you can see how each of the languages behaves as the memory requirement goes up. -Bryan P.S. I did not receive an email notification regarding your last post. I wonder if 'slurp' is a spam word?

Justin James
Justin James

Bryan - Thanks! I am still not sure exactly how to go about eliminating disk speed from the equation, except for piping input from a random line generator. And even then, the line generator may turn out to be slower than the code itself, also holding it up! On your 3 points... 1) No worries on that one! If Perl is something you find yourself working with in the future, my best suggestion is to check out the perlfaq sections of the documentation. It is a good, "why" based approach that explains why some code is better than other code, even when the end result and even the logic is the same. The piece I used came straight from there years ago, and I've been using it ever since. I tried doing it the way you did it, ages ago, since it "feels" natural. It is actually a GREAT example of how clever Perl is, when you think about it... "foreach" working on a *filehandle* and being smart enough to know that in that context, you wanted to read line by line! Any other langugage would have gagged. It implied a slurp of the whole file, and then a record split on the newline character. That is why is was slow too, BTW. But still cool to see Perl's "do what I mean, not what I said" philosophy in action. 2) That makes sense to me. Perl does support a case statement of sorts... it is a little odd though, and I prefer to avoid it many times. It is one of those odd cases where instead of distinctly building it in, you are shown in the docs how to use a completely different set of functionality to do the same thing. On the one hand, that decision feels kind of hack-ish to me, but it makes sense from the standpoint of elegance. Oh well. :) 3) Any suggestions on taking the speed out of it? I think if I had something pre-generate random text (or read it from disk first), store it in memory, then make the call and time the process while piping the data, that would take all of the disk considerations out of the loop. What do you think? J.Ja

bsm
bsm

Justin, I couldn't reply directly to your last post due to the maximum message limit. Feel free to post the code. Just a few comments: 1) As I said, I'm not really a PERL programmer so I just took from existing examples. None of the examples I found mentioned the subtleties you pointed out. My misuse of the language degraded its performance. This goes to highlight two points: any language can be made to under-perform; programming is more than just understanding the syntax. 2) My use of consecutive non-nested IF statements was intentional. I wanted the processing time to remain constant for each iteration to compensate for any discrepancies in the pseudo-randomly generated input files. My assumption was that a line could contain an 'A', a 'B', or something else unexpected. This could have been coded as a nested IF statement or a CASE statement. If-Then is ubiquitous and allowed the various programs to look almost identical. 3) I agree that on a sufficiently fast machine this ultimately becomes a test of disk throughput. My machine is a 2.6Ghz P4 with 1GB of RAM. Even at this speed, I believe disk IO had an impact on the timing. 4) For future comparisons, I should mention that I?m using the following runtimes/compilers: PERL ActiveState version 5.8.3 C# Visual Studio 2005 compiled to .Net Framework 2.0 VBS Windows Script Host 5.6 -Bryan

Justin James
Justin James

Bryan - Thanks for sending the code! I have looked at the Perl code so far, but with 2 changes (modyfying 1 line, adding a second line) I dropped it to under 50% of its former speed, at least in eyeballing 5 or so runs of your code vs. 5 or so of mine (didn't divide the speed numbers yet). The changes were easy: * Change the loop from "foreach $line ()" to "while ()" * Add this line to be the first statement within the while loop: "$line = chomp;" What this does is prevent it from trying to slurp the file into an array and then iterate through the array, and shifts it to a one-way data reader mode. It actually highlights one thing I also liked about Perl; this reader pattern is in the documentation (perlfaq5), and the documentation shows why this is usually the best way to do this. Unlike a lot of other langugages, the documentation goes beyond syntax, and into technique. Interestingly enough, replacing the regex with substr actually resulted in a performance *loss* (.10 seconds on my PC), which somewhat surprised me. Also interesting, in both cases, putting the second if statement in the else clause of the first (the idea was to minimize the number of calculations needed) also resulted in a small loss (again, about .1 seconds). My modifications also bring the Perl code much closer to the C# code, in terms of the approach to reading the data. The C# code on my system performed *worse* than my modified Perl code, but better from your original code; it turned in numbers between 0.114 seconds and 0.142 seconds, and there was a great deal of differentiation between multiple runs, for whatever reason. The Perl code was extremely consistent in its speeds, with only minor deviations from the average. I found this tidbit of data rather interesting. I also note, my system appears to be significantly faster than yours, judging by the numbers (my average with your Perl on the 100K file was around 0.225 seconds, and with my best modifications, around 0.095 seconds!), compared to the 0.812 average you were getting, and my 0.125 or so "eyeball average" on the C# vs. your 0.343. I am sure that at this point, I am getting very close (if not there) to hitting the roadblock of disk speeds. It helps that I am reading from a RAID 1 (RAID 1 can read 2 disks at once, VERY fast for read, but slow for writing) on SATA II with WD RAID Edition disks, and it is a Core 2 Duo E6300, with 2 GB of super fast RAM and 4 GB of ReadyBoost memory. Not bragging on machines (hey, I was running an Athlon 1900+ w/512 MB RAM until this winter...), but it is obvious to me that running this code on this PC is now getting close to hitting the limits of disk speed, which makes testing the code speed rather tough; I may consider rewiting the code for both to read from STDIN, and piping input to them in order to have the input pre-read to eliminate disk speed from the test. With your permission, I would like to publish your code on TechRepublic in my blog (I write in the "Programming and Development" space), and discuss some of these general findings. I find this topic fascinating, and I would like to be able to give full credit where it is due. If this is not OK with you, I will just reference this thread. Thanks again! It has been an illuminating night! J.Ja

Justin James
Justin James

I am inclined to beleive that much C# code is around that 90% mark. I think it drops significantly when you have a resource intensive application, and you are now relying upon the garbage collector though. I worked on an test app to try out some multithreading tricks a while back, and the number one performance hit was on RAM. The garbage collector simply was not keeping up with a zillion short lived threads and objects, and it spun too many wheels trying to clean up. Very messy. But it is things like this that caution me about using .Net in a high volume situation. You are 100% right about those try/catch blocks. They are explicitly the reason why the major data types in .Net now have a "TryParse()" method! Before, (in 1.X), you had to wrap an instantiation within try/catch to determine if incoming data could fit your variable (or go through a ton of manual coding hoops, like checking the length of a string, and if it met certain conditions, checking for decimals, the value as expressed in characters, and so on, just to see if it was an integer or not). Especially in Web development, where all input is a string, it is very easy to kill your performance by using try/catch instead of TryParse(). You can send the code over to me a j_james at mindspring dot com (why do email harvesters not bother to test for that pattern?). I would also be interested in seeing the C# and VBScript code; you may have written it much smarter than I did! On that note, you compared C# to Perl; I compared VB.Net to Perl. I am fairly convinced that the VB.Net compiler is not nearly as optimized as C#'s. For example, I saw a 100x (as in, 10,000%) code speed increase in VB.Net when I moved from: For x = 0 to InstanceOfCollection.Count - 1 'Do stuff Next x to... Dim max as Integer max = InstanceOfCollection.Count - 1 For x = 0 to max 'Do stuff Next x I heard that C# is smart enough to not recalculate "InstanceOfCollection.Count - 1" on each interation if the collection has not been touched during the iteration, but I have not tested this myself. If this is true, it would actually make C# the newbie language of choice, since this would mean that VB.Net developers need a lot more refactor to acheive performance equal to what the C# compiler does with its own optimizations out of the box. J.Ja

Justin James
Justin James

Yes, I too wish there was a good implementation of Perl for .Net. So far, the best I found is ActiveState's Perl.Net, which was truly miserable. J.Ja

Tony Hopkinson
Tony Hopkinson

that this facility was something new, which is fallacious to the point of being a party soundbyte. I wouldn't have any problem using .net to do this if I was working in a .net environment and had nothing in place that could already accomplish it. Certainly I'd take a dim view of someone adding perl to the required list for a system to work, for the sake of a tenth of a second in execution time. That is simply an ongoing maintenance trade off, something like this is so trivial, its not worth the effort. IronPerl is definitley high on my list of wants though.

bsm
bsm

I included VBScript as a comparison to another scripting language. It highlights just how efficient the PERL interpreter is in that it's 6x faster than VBScript. C# is managed code but that should not be confused with interpreted. The byte code is JIT compiled to optimized native instructions at runtime. The resulting native instructions call the managed runtime for resource management whereas "true" native code makes system calls. The penalty incurred on the first run is due to the JIT compilation. My understanding is that the speed increase on subsequent runs is due to caching of the compiled byte code. Microsoft claims that the performance of C# is within 90% of C++ compiled by Visual Studio. I have not tested nor verified that claim. I have seen C# perform miserably. One of our C# programmers complained our Oracle 10 database was too slow; his application was taking several minutes to process a few thousand records. The DB admin replicated the logic in PERL. His script was significantly FASTER than the C# application proving Oracle was not the bottleneck. Profiling the C# application showed that the majority of time was spent in exception handling. The code was using a try-catch block to handle nulls. We learned the hard way that raising exceptions is incredibly expensive. By explicitly testing for nulls prior to processing a record, the execution time was reduced from >3 minutes to

Justin James
Justin James

Bryan - In the program I did the same thing with, I was comparing VB.Net to C#. It may be the base that C# is significantly faster than VB.Net, or have a particular compiler optimization that VB.Net is not using. It may also be a version difference, I did this on .Net 1.1, right before .Net 2.0 and VS 2005 came out; I am sure that Microsoft has worked hard to improve the speed since then. I may also point out that neither .Net nor VBScript are "native code". VBScript, like Perl, is an interpreted language. C#, VB.Net, and all other .Net languages compile to bytecode that run within the .Net CLR environment, which is an interpreter/VM of its own. It's a minor quibble, but an important one. The Perl interpreter has a heck of a lot less overhead than the .Net Framework and the .Net CLR. As such, the Perl interpreter is threoretically faster than the .Net CLR, *all else being equal*. The .Net CLR takes time to be fired up and such. One thing that I have found, is that for a one shot run through, .Net is VERY slow on the initial pass, but subsequent repetitions of the same code are much faster if the process stays in memory. That may also affect the numbers a bit. My test was done with a pure console application. On interactive tests, from within a constantly running app, I would definitely expect .Net to pick up some steam. On that note, if you are running the code multiple times, but restarting the app on each pass, .Net will also gain a boost, because Windows will be loading the Perl interpreter each time from scratch and interpreting the code from scratch each time, while Windows will happily cache the .Net assemblies, and the static nature of the .Net code will provide an advantage there. I would definitely be interested in seeing the code that you wrote, it would give some clues as to why your numbers are different from mine! J.Ja

bsm
bsm

Justin, It is inherently impossible for PERL to run 100 times faster than native code unless the native code contains some sort of flaw. I replicated your experiment in PERL, C#, and VBScript using a 4MB 100,000 line file as input. I ran each program 5 times to validate results. The results are as follows: Language Min Max Avg -------------------------------- VBScript 5.671s 5.999s 5.768s PERL 0.812s 0.874s 0.843s C# 0.343s 1.468s 0.634s In each case the input file was read sequentially one line at a time. The first char of each line was checked to see if it was either an A or a B. Lines beginning 'A' were written to A.txt and lines beginning 'B' to B.txt. C# proved to be 33% faster than PERL on average. Even comparing PERL to the worst case scenario, VBScript, it is only 6x faster; no where near a 100x performance increase. We use PERL, Python, C, C# and a plethora of other languages in our daily work. We find that for most tasks, C# is 10x faster than any of our favorite scripting languages. I will provide the code and data files to anyone who wants to replicate results. -Bryan