One of the big shifts since the beginning of the Internet era is that applications have moved to real time data processing. Many developers have never written anything with the batch processing paradigm, thanks to changes in education and what is experienced in the business world. All the same, batch processing is still a valuable technique that has a place in a developer's toolkit.
What is batch processing?
Batch processing is when your code is designed to iterate over input that has accumulated since the last time it was run; it is also designed to be run on a regular basis. For example, a bank will record transaction data throughout the data, and then use batch processing at night to post the transactions to customers' accounts. This is in contrast to real-time processing, where the calculations are a direct response to the input occurring.
I first learned about batch processing in high school, when I was taught to write in COBOL. To this day, COBOL applications are still heavily used in many batch processing scenarios, including end-of-day processing for banks, insurance companies, and other similar big-data environments. But there is no reason for batch processing to be limited to mainframe-like applications; for example, a service that has a recurring billing component or perhaps a scheduled reporting system needs a batch processing system.
Advantages to using batch processing
Batch processing has a number of advantages over real-time processing, including the following:
- Transactions can be much faster since they do not have to be processed up front; instead, just enough data needs to be provided so the batch processing task can find the data it needs.
- If there is a period of the day when your company is less busy, your batches can be run during the slow time, leveraging otherwise idle infrastructure, and requiring less of it to handle peak load times.
- Batch processing lends itself very nicely to virtualized and cloud environments where resources can be scaled or transitioned to loads as needed; this can save serious amounts of money on hardware and power.
- Batch processing is great in situations where the application has unusual resource needs because the systems that interact with the users can be separated from the systems that do the processing; again, cloud and virtualized environments come to mind.
Why has it fallen out of favor?
If batch processing is so great, why is it slowly disappearing? Web development has a lot to do with it, but the changing face of business in general has a role to play as well. Here are the primary reasons:
- Web applications cannot start themselves, they need a trigger to start, making the idea of scheduling a batch non-intuitive.
- Web applications are expected to be deployable via file copy only, which nixes the idea of an OS-level scheduler.
- Web applications are supposed to perform small, discrete amounts of work and return results in a few seconds; one cannot easily use a user's action to trigger any kind of batch processing.
- Companies are doing business on a global basis, with no "overnight" anymore or little downtime.
- Users expect instantaneous results or transactions to be reflected immediately.
While I haven't programmed in COBOL in quite some time, I have written applications that used batch processing on a regular basis. When working on a Web application, I have tried two approaches.
The first path is to have a portion of the Web application perform the batch processing and to have a scheduled job retrieve that page as it required. While this works, and it allows you to easily leverage the rest of your code base, I feel like there are too many moving parts in the scenario, especially when it can be a long-running task.
The other method I've tried is to write an entirely separate application that can be called from the system's scheduler. I like this approach much better, but it only works when the batch processing task can have access to the main application's logic (usually through a shared library) or does not need that access.
There are other choices too; for instance, databases have the ability to schedule jobs, and if your processing can be fully contained within the database, that may be an excellent option for you. Some development frameworks and systems (like the OutSystems Agile Platform, which I have been using a lot lately) have the notion of batch tasks built-in, which makes life easy. It's also possible to run your system continuously (as a daemon or a service) and use an application-level timer to trigger the processing as-needed.
However you go about it, though, you will need to make sure that your method can be scheduled the way you need it to be scheduled and that it works well with your deployment choices.
J.JaDisclosure of Justin's industry affiliations: Justin James has a contract with Spiceworks to write product buying guides; he has a contract with OpenAmplify, which is owned by Hapax, to write a series of blogs, tutorials, and articles; and he has a contract with OutSystems to write articles, sample code, etc.
---------------------------------------------------------------------------------------Get weekly development tips in your inbox Keep your developer skills sharp by signing up for TechRepublic's free Web Developer newsletter, delivered each Tuesday. Automatically subscribe today!
Justin James is the Lead Architect for Conigent.