Software Development optimize

The architecture of a flexible .NET file processing system -- Part 4

The first three parts of this series concentrated on the overall architecture of the .NET file processing system and how the messages get created and routed. In this installment, Zach Smith explains the role of the task specific processes and describes how to set up the system so that it is easily scalable.

 

The first three installments of this series focus on the overall architecture of the .NET file processing system and how the messages get created and routed.

Up to this point, the files have not been accessed in any way; they haven't been moved, parsed, or loaded into the database. The reason for this is that moving, parsing, or loading files into the database are all examples of tasks that are best handled by highly specialized processes. In other words, these are task specific processes.

The diagram in Figure A shows how these processes fit into the architecture. Figure A

Figure A

Each process is responsible for only one task. For example, you should not have a single process that both unzips and imports files into the database. These tasks should be split into different processes because segregating responsibility into different processes simplifies scaling and helps with troubleshooting. If you were to have all of the logic bundled up into one process, a simple change in the unzipping logic would require a complete redeployment of the entire application.

Examples of task specific processes

Task specific processes contained within the system will vary depending on the system requirements. For the sake of this article, I will assume the following task specific processes are needed:

  • A mover process that moves files from the file drop location into the processing system directories.
  • An unzipping process that unzips files coming into the system.
  • A parsing process that parses the files and puts them into the database.
  • An archiving process that moves the files into an archive for auditing purposes.

With those processes in mind, here is how a file would flow through our system (for steps 1 through 3, refer to part 2 and part 3 of this article series):

  1. The file gets uploaded to the FTP server.
  2. A file message that represents that file is sent to the processing queue from the incoming file listener.
  3. The router picks up the file message and determines what needs to happen to the file (in this case, the file needs to be moved).
  4. The router sends the file message to the mover process queue.
  5. The mover process picks up the file message, moves the file to an internal processing directory, and then posts a file message to the processing queue.
  6. The router picks up the file message, determines that the file needs to be unzipped, and sends a file message to the unzip queue.
  7. The unzip process picks up the file message from the unzip queue, unzips the file, and sends file messages for each extracted file to the processing queue.
  8. The router picks up those messages and determines that they are ready to import into the database. It then sends the file messages to the importing queue.
  9. The import process picks up the file messages and inserts the records into the database. After the import is complete, the import process sends a file message to the processing queue for each file that was imported.
  10. The router picks up the messages, determines that the files are ready to be archived, and sends the file messages to the archive queue.
  11. The archive process picks up the messages and archives the files.

A file should be able to go through these steps in 2-5 seconds. This isn't the fastest way to import a file, but the flexibility and scalability you gain from this architecture more than makes up for any performance concerns. Also, if you need to import files in less than five seconds, you should probably look at implementing the system via Web services.

Task specific processing

Each process is designed for a specific task, yet all processes share a few common operations. The following operations are required for the system to work as a whole:

  • They will pull messages from their respective task-specific queues.
  • They will add an "Action" to the file message to indicate what has been done to the file.
  • After they process the file, they send a message back to the processing queue to indicate that the file is available for further processing, at which point the router will pick the message up and route it to the correct location. The only exception to this would be the end of processing for a file (in this article, it would be the archiving process). After the file is archived, no more processing needs to be done, so there is no reason to send the file message into the processing queue.

The processes share a few common operations, so it may be useful to develop a base library to use for all of your task specific processes. This would allow you to rapidly develop the processes without worrying about common operations such as pulling a message from a queue. This also gives you one code base to maintain for the common operations.

When developing these processes, you should abide by the following rules:

  1. Processes should only perform one task each.
  2. Processes should be as lightweight as possible to improve total system performance. If you have one task that is highly complex, consider breaking it down into smaller tasks.
  3. Processes should be optimized for the task for which they are responsible. A good example of this is importing records into the database. Instead of inserting records one at a time, consider threading the operations so that multiple records are inserted. This type of optimization can greatly increase the performance of the system.
  4. Processes should never send a file message to any other queue than the processing queue. It is the router's job to determine where messages should be routed, and if you're sending messages directly to another queue from the task specific process, you're short circuiting the system.

If you follow this approach, you will end up with a very flexible system for processing files. This architecture allows you to place task specific processes on any number of servers, which allows you to scale the system based on volume requirements.

One of the most interesting features of this architecture is that task specific processes can be added and removed without affecting the rest of the system. For instance, if business requirements change and you must now decrypt incoming files before unzipping them, you can simply develop a task specific process that decrypts the files. After that process is installed, the only other change you need to make is to instruct the router to route encrypted files to the decrypt process.

Scaling the system

Scaling this architecture is very easy; the components are all independent and are not required to be on the same server. I suggest the following configuration for a typical environment based on the components listed above:

  • One server to run the listener component, the router component, and the archiving component.
  • One server to run the unzipping and moving components.
  • One server to run the parsing/importing component.

This type of configuration should handle well over 150,000 files a day. If your implementation requires extra computing power, you simply add another server.

It is the router's job to load balance the messages that are sent to the various task specific process queues. For example, if you are running three different instances of your unzip component on three different servers, it is the router's responsibility to send one message to Server 1, the next message to Server 2, and the next message to Server 3. To do this, you would modify the router's logic to look something like this:

<RoutingLogic>

   <Route Name="Unzip">

      <Destination>some-serverimportqueue</Destination>

      <Destination>some-serverimportqueue2</Destination>

      <Destination>some-serverimportqueue3</Destination>

      <Keys>

         <Field Name="FileName" Value="^[a-b0-1]*.zip$" />

      </Keys>

   </Route>

</RoutingLogic>

This configuration allows your router to have several different destination queues for each type of file.

One last note about scaling: If you expect your volume to be dynamic, and you feel that extra processing power may be required, put all of the components on a single server and image that server's drive. This will allow you to quickly deploy another server based on the image to handle load; and since all of the components will be installed, you can simply turn on the ones that are needed while leaving the others disabled.

-------------------------------------------------------------------------------------------------------------------

Get weekly development tips in your inbox Keep your developer skills sharp by signing up for TechRepublic's free Web Developer newsletter, delivered each Tuesday. Automatically subscribe today!
2 comments
verelse
verelse

Using SQL Server Integration Services (SSIS) packages makes this sort of work very simple. SSIS includes full support for Windows Workflow Foundation (WWF) and a graphical workflow designer. SSIS is already proven scalable and supports complete task separation, logging, transactions, error handling, customization, etc. Creating custom, proprietary systems makes little sense for manageability.