The architecture of a flexible .NET file processing system -- Part 2

The file listener components

In Part 1 of this series, Zach Smith described the overall architecture of a dynamic and scalable file import system. In Part 2, he gets into the details of designing a highly flexible and scalable file processing system using .NET framework technologies and shows you how the file listener components are implemented.

This blog entry is also available as a TechRepublic download in PDF form.

Before we can get into unzipping components, parsing libraries, or inserting data into the database, we need to figure out how we will detect when a file has been dropped onto the system. This is a critical portion of our import process simply because every file sent to the system must pass through this module. If the listener components miss a file or simply stop working, our whole system is useless. Of course, on the flip side of that, if our file listener components are fast and stable, they will enable the rest of the system to be fast and stable as well.

The job of the listener components in our architecture is simple: Detect when files are dropped and send a message to the processing queue to signal that a new file has arrived.

Listener component design

Depending on your requirements, there could be many things to consider when designing your listener components. For example, if you need to detect files being uploaded to an FTP server, you will have different considerations than if you were detecting files being placed onto a mapped network drive.

For this reason, we need our listeners to be modular and not dependent on the rest of the system. Our goal here is to be able to come up with new listeners to satisfy changing business requirements without having to change any of our existing code base or processes. To accomplish this, our listeners should be built as stand-alone processes. Windows services are perfect for this type of task, but in theory, any stand-alone process could act as a listener.

The diagram in Figure A will give you an overview on how our listeners fit into our architecture:

Figure A

Listener components design

As you can see, each listener is on its own and is connected to the rest of the system only via file processing messages that are sent to the processing queue (which is a Microsoft message queue). If you're unsure what the processing queue is for, or just need a refresher, please take a look at Part 1 of this series.

This arrangement allows the listener components to act independently of one another. For instance, there is nothing stopping the FTP listener from scanning 100 FTP servers in its quest to discover files to import. The only thing that the listeners really must do is detect files and send a file processing sessage to the processing queue.

Listener component implementation

There are a few questions you need to ask before deciding on how to implement your listener components:

  1. What kind of locations will files be picked up from? FTP? HTTP? Web services? Mapped drives?
  2. What are the speed requirements of the system? Can files sit for a couple minutes before being imported or do they need to be immediately picked up?
  3. What type of volume is expected for the system?

Answering the questions above will get you started on what kind of listener components you need to create. Remember, you can have several different listeners feed into the same import system. If you find yourself building a listener that does multiple tasks (e.g., listens to FTP and mapped drives), ask yourself if it would make sense to split that single listener up into two services. Splitting up functionality can aid in the scalability of your system.

One thing to consider if you're developing a listener for a mapped or local drive is the FileSystemWatcher component that is provided by the .NET Framework. You can instruct this component to "watch" a directory or group of directories, and any time a file is created within those directories, the component will fire an event. You then catch that event process for the file. This type of functionality is much better than simply looping through a group of directories looking for new files to import.

The file processing message

Instead of passing complete files around the system, this architecture uses lightweight messages that are passed through a series of message queues. This decreases network traffic and load on the processing servers. While these messages are used throughout the system, it is the job of the listener component to create these messages and "kick off" processing of a file by creating the file processing message and sending it to the processing queue.

Obviously the file processing message will need to have certain information regarding the file so that the import system will know what to do with it. Below is a listing of file properties that may be useful to have in the processing message:

  • File name -- Simply the name of the file
  • File size -- The size of the file
  • Origination Directory -- The directory that the file was detected in
  • Origination Type -- The type of location the file was detected at (FTP, HTTP, local drive, etc.)
  • File extension -- This can be extracted from file name
  • Anything else you need to process the file

The following are other properties you should consider including in the file processing message:

  • Message ID (GUID) -- Identifies the message
  • Parent ID (GUID) -- Identifies the parent of this message, if any. This would be used for files that are unzipped from other files.
  • Children (Array) -- An array of file processing message objects that represents all files extracted from the file that this file processing message represents
  • Actions (Array) -- An array of actions that have happened to this file

It is important to remember that the file processing message will be passed around to different components of the system until processing on the file is complete. With this in mind, we will need to define an Actions property in the file processing message, which will be an array of Action objects.

Properties of the Action object:

  • Time Stamp (DateTime) -- Time the action was created
  • Server Name (String) -- Server that performed the action
  • Command Line (String) -- Program that performed the action
  • Action (String) -- The action that was performed

It is the job of each task-specific process to add an Action to the file processing message to indicate what was done with the file. When the file has been completely processed, or an error occurs, the file message should be sent to a logging queue, where it will be inserted into the database. This data will help in determining what has happened to a file.

Next, message routing

After the file message is created and sent to the processing queue, our message router will pick it up and determine where it needs to go. In Part 3, I will discuss how this process works and things to consider while designing it.