Tech & Work

Perl provides distributed processing punch

Although you may be well versed in using Perl to process forms and manipulate text, you may not think of it in terms of large-scale, distributed applications. This article will change your mind.


In addition to being a great tool for report processing and text filtering, Perl can be used for general-purpose programming. For instance, I used Perl to create a job server that allowed my company to reduce the processing time of several gigabytes of digital map data. Our original estimate to process the data was approximately one month. We processed all the data over a single weekend using 10 otherwise-idle Windows workstations.

The basics of parallel processing
The basic principle of parallel processing is simple: If a large task can be broken up into independent units of work, it can be distributed to several machines and processed at the same time. Parallel processing, if done properly, should yield nearly an n-times reduction in the processing time (where n is the number of machines used to process the task). For example, if you have 30 days' worth of work to do, and you properly distribute it to 10 machines, the task should be done in about three days.

The hardest part of parallel processing is deciding how to break up the large task. If you break it into units that are too small, your communications overhead becomes significant compared to the time to process the task. If you break the task into units that are too large, the gain from using multiple machines is lost. The ideal size results in the processing time for each subtask that is several orders of magnitude larger than the communications overhead but that also results in many subtasks per client.

Cut to the chase!
The job server is a single Perl script that can be run in different modes based on command line options. The tool will allow you to add or remove jobs, check the status of a queue, process jobs from the queue, or cause some or all of the machines processing a queue to exit. The server takes a message from the queue and passes it to the command shell. When the command finishes, the server checks the queue for another task.

You will find the code for the job server in Listing A. To use it, save the script to a file (I called mine jobserver.pl) and run it. This script requires the Cwd and Term modules, so if you do not already have them, you'll need to install them. Check with your nearest CPAN mirror to find these modules.

Each of the supported actions is in a separate function. There are also support functions to enqueue and dequeue a task, write to a shared log, return the current server status, and so on. Even if you do not use the job server itself, these support routines may be useful in your own projects.

The job server uses file base communication so that there is no need to install special communications drivers. Windows file sharing or any networked file sharing system will work.

Basic usage
Using the job server is simple. One or more machines act as servers to a queue. Jobs are added to the queue using the same tool. An individual machine may service more than one queue by loading the job server more than once, specifying a different queue name each time. A single machine may also act as more than one job server for the same queue, but this is generally useful only for testing. You can also use a single machine to add jobs to the queue as well as to service them.

Housekeeping
To save having to write Perl jobserver.pl every time you launch the job server program, it is handy to either create a batch file or compile the job server code. If you create a batch file, remember that you must "call" it from another batch file or it will never return.

From here on, we will assume that you have either created a batch file or compiled your script into an executable program.

Server setup
To have a machine service jobs from a queue, simply use the service command as follows:
jobserver service testque server1

The queue name is testque and the server name or process ID is server1. You can name the queue and the server anything that will help you identify them. Once you have a server up and going, add a job to the queue and watch it service the request.

Adding jobs
Adding a job to the queue is simple. Just launch jobserver and then tell it what queue you want to act on and the command to add:
jobserver add testque echo this is a test

In this case, we have added a simple echo command to the queue. The job server will run this command and then return to monitoring the queue. Since the server checks the queue only every five seconds, you may have a wait a few moments before it starts to process the job.

Results
Once processing is complete, you should have three files in the current directory: the testque file itself, a testque.log, and server1.out. The testque file contains the commands to be executed. The testque.log is a running log of everything that has happened to the queue. All servers for a given queue share the queue and log files.

Each server creates a unique output file. There should be a server output file named server1.out. This file contains whatever the command wrote on standard output. This file is overwritten with each command that is executed, so it shows a snapshot of the output from the last command executed to help with debugging if needed.

Killing the servers
To stop the server, you use the kill command. Kill can either stop all servers or a specific server. If you do not specify a process ID (server name), all servers will exit. If you specify a specific process ID, only that server will exit.

Monitoring servers and queues
The status command displays the current status of all active servers and a synopsis of the queue contents. For each server, the current task is shown, along with the start time of that task. If the server is not currently working on a job, ‘idle’ is shown. If there are more jobs in the queue than can reasonably be displayed, status shows the first and last five jobs and indicates how many were omitted.

To get a complete view of the queue, use the view command. View shows the entire contents of the queue in the order it will be executed. You may want to pipe the output through more or some other paging program.

Deleting jobs
You delete specific jobs from the queue using the del command. Just specify a queue name and the exact text of the job to be deleted. All jobs matching the specified text will be deleted if there is more than one.

Caveats
When using the job server, you need to keep a couple of things in mind. First, a server may hang while processing a job. You will have to check the log to see what job was being processed and add that job to the queue again.

Also, if the queue becomes very large, the process of removing a job from the queue can become substantial. In our case, we found that adding all 50,000 tasks at once made the queue processing too slow. We created a tool that watched the size of the queue file and added jobs only when the size was below about 10K.

Conclusion
Distributed processing provides a way to increase the processing power available for a given task. If properly segregated, a task can usually be completed in 1/n time, where n is the number of machines applied to the task. The job server presented here provides a simple mechanism for starting a job server, adding jobs to and removing jobs from the queue, and monitoring the queue and servers. Using the job server makes the housekeeping tasks of parallel processing very simple.

 

Editor's Picks