Developer

The architecture of a flexible .NET file processing system -- Part 1

This article is also available as a TechRepublic download.

There are many companies that rely on importing and exporting data via files as a communication mechanism with their partners. While this seems like an old and out-dated process in this age of Web services, there are and probably always will be companies using this method for communication.

Many times, these systems are one-off designs that aren't very flexible and pose significant hurdles in terms of maintainability and scalability. In this series of articles, I will lay out the groundwork for an architecture that is able to be both highly dynamic and scalable.

If you have a thorough understanding of C#, message queuing (MSMQ), XML, and serialization, you will have a pretty easy time picking up on the concepts that will be presented. If you are a little weak in one or two of these skills, you should still get a lot of good information from this series. However, you may have a harder time with implementation of the techniques presented.

Requirements

Before we start looking at our architecture, we need to define what this system is meant to do. Below are the requirements for this system:

  • Must be horizontally scalable
  • Must be able to load files of any type (fixed length, delimited, XML, etc.)
  • Must be able to transform these files from one format to another (i.e., unzip/decrypt files)
  • Must parse the files via configurable file layout specifications
  • Must insert the data from files into a database
  • Must be able to accept new processing modules without requiring the whole system to be reset or recompiled
  • Must load files immediately after they are placed on the server

These requirements are very general, and in your situation there will probably be more specific requirements that must be met. If you don't see something you need in the list above, keep reading. This architecture is meant to be highly flexible, and chances are, you will be able to plug in the features that are required in your situation.

The architecture

The diagram in Figure A shows a general overview of the architecture I am presenting in this series of articles:

Figure A — Diagram 1

Figure A — Diagram 1

Note: The arrows represent information flow, not necessarily the flow of physical files. 

The following steps are presented in this diagram:

  • Files are dropped onto the system and "Listener" processes detect these files.
  • The Listener Process then sends a message (more on the message construction in the next article) to the Processing Queue to indicate the file has arrived.
  • The Message Router then picks up the message from the Processing Queue and determines what needs to happen to the message.
  • After routing logic has been performed and a decision has been made on the destination of the message, the message is forwarded to its destination queue.
  • After the message arrives at its destination queue, task-specific processes will read it. These task-specific processes are completely independent of the other processes and can do anything they please with the messages. This includes using the messages to unzip the uploaded file, parse the message's associated file into the database, or do some task to the file and send the message back to the processing queue for further processing.

Using this type of technique gives us several advantages in the way of scalability and maintainability:

Using message queues allows us to have the processing applications on separate computers and also enables us to have multiple processing applications pulling from the same queue, thereby spreading out the workload (horizontal scaling).

Instead of routing the file all over the place, we are instead routing lightweight messages that contain information about the file. This eases network traffic between the servers involved with processing and keeps file movement to a minimum, which helps with debugging issues. Keep in mind that there are times when file movement is pretty much required, such as unzipping and decrypting.

Using a Message Router to determine where messages should go makes maintainability easier in the sense that all of your routing logic won't be tied up into the listener processes. This logic will be stored in a database or an XML file — the layout of this logic will be covered in a future article.

The heart of this architecture is really in the interaction between the Message Router and the task-specific queues. Using this type of technique allows you to easily plug in new task-specific processes to meet changing business requirements.

For example, assume that you're currently just importing flat files directly into the database. Chances are that you're detecting the new file, opening the file, parsing it, and inserting the data into the database. In that type of system, when a new requirement comes up (such as unzipping before parsing), chances are you'll have to either rewrite your current parser to do the unzipping or modify some other code and redeploy the whole solution.

With the architecture shown above, you would simply have to modify the routing logic settings (which aren't hard coded — this is critical), create a new "Unzipping" task-specific queue, and write an unzipping process to consume messages from that queue. You would not have to change any code in your existing processes or redeploy them. In a corporate environment, this can save a lot of time, as there are often several steps involved with moving code changes into production.

Notes on the architecture

Another point that needs to be made about this system is that after a task-specific process works with the message, a message is almost always sent to the processing queue from the task-specific process. This allows the system to figure out on its own (via the routing logic) what needs to happen to the file next, instead of having the task-specific process figure out what to do.

In my own experience, this type of functionality has saved me more than once. One instance was when our partners were sending zip files that contained other zip files. (We expect a zip file to contain only other normal flat files.) After our unzipping process has decompressed an archive, it will send messages to the processing queue for each file that it extracted. In this case, the files extracted were zip files, and after being sent to the processing queue from the unzipping process, they were sent directly back to the unzipping process to be decompressed.

If we had our logic for where to send the result of unzipping handled completely in the unzipping process, we would not have correctly handled this situation.

In the next article

In the next article, I will describe how the incoming listeners are implemented and explain what type of data will be contained in the messages flowing through the message queues.

Editor's Picks