Cloud optimize

Batch processing's value to Web developers

Although batch processing isn't used much anymore, Justin James says it's still a valuable technique that has a place in any Web developer's toolkit.

One of the big shifts since the beginning of the Internet era is that applications have moved to real time data processing. Many developers have never written anything with the batch processing paradigm, thanks to changes in education and what is experienced in the business world. All the same, batch processing is still a valuable technique that has a place in a developer's toolkit.

What is batch processing?

Batch processing is when your code is designed to iterate over input that has accumulated since the last time it was run; it is also designed to be run on a regular basis. For example, a bank will record transaction data throughout the data, and then use batch processing at night to post the transactions to customers' accounts. This is in contrast to real-time processing, where the calculations are a direct response to the input occurring.

I first learned about batch processing in high school, when I was taught to write in COBOL. To this day, COBOL applications are still heavily used in many batch processing scenarios, including end-of-day processing for banks, insurance companies, and other similar big-data environments. But there is no reason for batch processing to be limited to mainframe-like applications; for example, a service that has a recurring billing component or perhaps a scheduled reporting system needs a batch processing system.

Advantages to using batch processing

Batch processing has a number of advantages over real-time processing, including the following:

  • Transactions can be much faster since they do not have to be processed up front; instead, just enough data needs to be provided so the batch processing task can find the data it needs.
  • If there is a period of the day when your company is less busy, your batches can be run during the slow time, leveraging otherwise idle infrastructure, and requiring less of it to handle peak load times.
  • Batch processing lends itself very nicely to virtualized and cloud environments where resources can be scaled or transitioned to loads as needed; this can save serious amounts of money on hardware and power.
  • Batch processing is great in situations where the application has unusual resource needs because the systems that interact with the users can be separated from the systems that do the processing; again, cloud and virtualized environments come to mind.

Why has it fallen out of favor?

If batch processing is so great, why is it slowly disappearing? Web development has a lot to do with it, but the changing face of business in general has a role to play as well. Here are the primary reasons:

  • Web applications cannot start themselves, they need a trigger to start, making the idea of scheduling a batch non-intuitive.
  • Web applications are expected to be deployable via file copy only, which nixes the idea of an OS-level scheduler.
  • Web applications are supposed to perform small, discrete amounts of work and return results in a few seconds; one cannot easily use a user's action to trigger any kind of batch processing.
  • Companies are doing business on a global basis, with no "overnight" anymore or little downtime.
  • Users expect instantaneous results or transactions to be reflected immediately.

Various approaches

While I haven't programmed in COBOL in quite some time, I have written applications that used batch processing on a regular basis. When working on a Web application, I have tried two approaches.

The first path is to have a portion of the Web application perform the batch processing and to have a scheduled job retrieve that page as it required. While this works, and it allows you to easily leverage the rest of your code base, I feel like there are too many moving parts in the scenario, especially when it can be a long-running task.

The other method I've tried is to write an entirely separate application that can be called from the system's scheduler. I like this approach much better, but it only works when the batch processing task can have access to the main application's logic (usually through a shared library) or does not need that access.

There are other choices too; for instance, databases have the ability to schedule jobs, and if your processing can be fully contained within the database, that may be an excellent option for you. Some development frameworks and systems (like the OutSystems Agile Platform, which I have been using a lot lately) have the notion of batch tasks built-in, which makes life easy. It's also possible to run your system continuously (as a daemon or a service) and use an application-level timer to trigger the processing as-needed.

However you go about it, though, you will need to make sure that your method can be scheduled the way you need it to be scheduled and that it works well with your deployment choices.

J.Ja

Disclosure of Justin's industry affiliations: Justin James has a contract with Spiceworks to write product buying guides; he has a contract with OpenAmplify, which is owned by Hapax, to write a series of blogs, tutorials, and articles; and he has a contract with OutSystems to write articles, sample code, etc.

---------------------------------------------------------------------------------------

Get weekly development tips in your inbox Keep your developer skills sharp by signing up for TechRepublic's free Web Developer newsletter, delivered each Tuesday. Automatically subscribe today!

About

Justin James is the Lead Architect for Conigent.

19 comments
delaware_samurai
delaware_samurai

I still use batch processing, I work in fedral govt and batch processing is used all the time, it is vey much needed. There is no other way with very large files.

Saurondor
Saurondor

It is a very trivial thing to setup a batch monitor process at application startup. In Java ( and I guess .NET has a similar implementation) all you need to do is setup a listener that will be invoked at startup and prior to handling the first user request. Said listener can startup a batch thread that runs in the background and handles on or more batch processes.

weber_christian
weber_christian

You make my feel younger (or less older)!! I'am developing a an aviation control system for a south american country and, of course, use a lot of realtime processing. But there are alot of tasks which do not require user input and aren't time critical, like: -Dayly checklist -Warning reports for future events (more than 24 hs ahead) -Database requests from other countries to consolidate flight information databases arround the world My core process is a C-programm running 24/7/365, feeding a MYSQL-databse. User have a web-interface writen in PHP. BATCH processes are also writen in PHP and CRON-started. Error reporting goes to TXT-logfiles and critical errors result in e-mail and SMS messages, also CRON-initiated. BATCH-PROCESSES are like a break in our (insane) all-instant-way-of-life.

oldbaritone
oldbaritone

A perfect batch application - running a nightly backup automatically, with little or no intervention required. Schedule it and just let it go; check the status the next morning and handle any errors.

shahzadafzal
shahzadafzal

nice thought i also use separate process totally independent of web application which performs batch processing. nice sharing

nerds-central
nerds-central

I was once contracting for a large web grocer (who will remain nameless - but they are still in business). They started out with an architecture where everything was on demand and either web or message driven. Over the first few years of operation, most back-office processing went over to PL/SQL batch jobs and similar. The simple reason was the amount of data this approach could churn through. Batch processes have good reasons for being more efficient. Queries are better optimized, context switches less, data better localised on caches ... it goes on. COBOL batch processing is still a huge business and growing in some areas! Anyhow - thanks for the post. If you are interesting in where COBOL has moved onto whilst you've been away - check out http://knol.google.com/k/alex-turner/micro-focus-managed-cobol/2246polgkyjfl/4

OSRDG
OSRDG

Web applications do batch processes in the background based on the business needs and application architecture. Now a days we are working on cloud computing platforms and most of the platforms support batch processing and scheduling.

Justin James
Justin James

Where in your applications hve you found a need for batch processing? And how did you implement it? J.Ja

Justin James
Justin James

The majority of the developers that I have worked with would be fairly overwhelemed by the task. You are talking about multithreading, need a way to inject new schedulers, working with the fairly abstract observer pattern (which is miserable to work with), and need to create a whole scheduling system if you want anything even slightly complex. :( And you still aren't addressing things like batch failure handling (do you email someone? Write to an error log? Send a message via SNMP? etc.). Sure, the base idea is fairly simple, but it is a huge effort to produce a robust and reliable system. J.Ja

vyassh
vyassh

In some web applications clients may upload data, for example payment instructions to bank. You might have to validate data before updating database. We have a batch process to handle client data while client side takes care of encryption and upload.

BryanReyn
BryanReyn

Needed to present FAX images in a browser. Initially, executed a conversion routine when the user selected a FAX to view. Moved that code out of the user application into a scheduled task - runs every 2 minutes. Good ol' COBOL - so many don't know how to write it.

george
george

I use batch processing quite regularly to link to legacy systems. Batch update from the legacy system in Mysql nightly. I then wrap a web interface around report creation to generate SQL reports per request. Batch processing can also be a good checks and balance for daily realtime processing. Run the batch daily to validate the realtime numbers. I feel you should embrace the synergy of different techniques to build robust web apps.

sysop-dr
sysop-dr

We do scientific and engineering calculations. Some of these are quite large and many need customised inputs. So we run our batches on a compute cluster and have a web front end to allow people to upload data and then queue the batch. Then can come in and check status and then when it's done collect their results.

gkar47
gkar47

We use the O/S package Quartz and Spring integration to schedule batch processing. We have various daily and weekly clean up work that needs to be done. The above works nicely in that it is running from our web code base, and uses our web server in off peak times. We can use JMX via the app server to modify the scheduling, if needed.

jgrazuli
jgrazuli

We process large quantites of records with customized criteria. We allow the user to kick off a batch process through a servlet using the java Runtime and Process object. We then monitor batch process updates to the database and report back on the status of step completion in real time.

Jim.Squires
Jim.Squires

We have many needs for batch processing. We get data files from various local governments (County assessor and treasurer data) as well as data keyed offshore that we receive in flat files. We used SQL Agent to call SSIS (SQL Server Integration Server) packages to read this data and update our database.

xmetal
xmetal

I have a friend that has several clients with basic websites that list their products. They do not wish to pay extra for dynamic database access and the content is fairly static anyway, being listed in long tables on several specifically addressed (permanently named) pages. The client maintains their data in Excel, uploads that via FTP to a known location then kicks of a process that does a transform to generate these static pages. (edit: nested this in the wrong place, why is there edit but no delete?)

four49
four49

SSIS/DTS would be a great tool if it weren't so frickin unreliable! Who here has ever worked with this POS and not wanted to pull their hair out trying to resolve "meta-data" errors that pop up out of nowhere? And forget making a change; if you want to add a field to your SSIS import task, you're better off rebuilding the task altogether. Otherwise, you'll be in meta-data hell trying to get it to recognize the new field. SSIS would be a wonderful tool if only Microsoft would put some development effort into making it more robust. Given that a lot of people use this in mission-critical systems, it is unacceptable to have it be so error-prone.

vyassh
vyassh

Normally you need not provide delete in such batches because it washes the earlier contents and put new in place irrespective of changes. Adding new pages requires the client to approach developer.