Master the art of big data job scheduling

Scheduling is a growing factor in big data optimization. Learn how big data job scheduling differs from its transaction processing counterpart, and why it's a task IT must do well.


Almost every department and business function in the enterprise has a big data application. Consequently, the need to centralize and harness big data assets has led more organizations to move big data responsibilities and assets to the corporate data center. This is a departure from many initial big data deployments that were characterized by distributed pockets of big data in departments throughout the business.

The current movement of centralizing big data in the data center is predicated on the hope that IT can manage these big data assets, with end business users receiving maximum benefits. However, for asset optimization to work in a centralized scheme, job scheduling becomes a central concern -- and a task that IT must do well. Scheduling of big data jobs is a multifaceted responsibility that has its technical, operational, and political aspects.

From the standpoint of a high performance computing (HPC) cluster, the goal of big data job scheduling is to process and complete as many jobs as possible. On the surface, this goal sounds similar to its transaction processing counterpart, but there are definite differences.

In big data and HPC, the cluster processing is done in parallel, and the technical goals are two-fold: (1) to have one "large" job running in the background while many shorter jobs are run (and completed) during the timeframe that the large job is running; and (2) to utilize upwards of 90% of the HPC CPU at all times, seldom encountering an idle moment.

In traditional transaction processing, transactions must be processed in a serial fashion and not in parallel. The goal is to reduce each transaction's processing time so that as many transactions can be processed as quickly as possible. Since there are resource "wait" times that can interrupt serial transaction processing, it is not unusual at different times of the day to underutilize CPU to where it can fall below 30% utilization. A utilization like this would be unacceptable for any HPC process.

What does this mean for IT?

Different sets of metrics are needed for HPC/big data and for traditional transaction processing. Where return on investment (ROI) is concerned, transaction processing is often measured by how much revenue can be captured by processing transactions faster. With HPC, the ROI comes in the form of totally consuming the asset (e.g., high utilization).

Assuming that big data and HPC are optimized from the standpoint of job scheduling and throughput, the other prong of the equation is assuring end business user satisfaction.

The primary metric that gauges user satisfaction is how quickly big data jobs are processed and delivered. But with many different big data jobs being parallel scheduled and processed, IT must also spend time with these users to collaboratively prioritize which jobs run when, and at what priorities. This is the political aspect of big data job scheduling.

A good way to approach this is to meet with big data end users, ideally in a steering committee setting. The collective group of end user decision makers can review the types of big data jobs that the enterprise expects to run, and reach concurrence on what these jobs' relative business priorities should be. If end users and executive management buy off on this, and if there is a review process that occurs on an annual, semi-annual, or quarterly basis, IT can ensure that its big data job scheduling is synchronized with end business expectations.

Watch for the "danger point"

The "danger point" for IT is when it gets preoccupied with data center operations (including the scheduling of jobs), and just makes it own scheduling rules. This approach can be very effective (and even preferred) in a traditional transaction processing environment, because everyone already knows that mission-critical transaction systems that run the business always come first. However, the approach doesn't work in a big data contest, because big data (even though it can be processed and analyzed in near real time) is still fundamentally a "batch" process as it parallel processes through an HPC computing cluster.

Mastering the art of big data scheduling will grow in importance as IT better understands why big data job scheduling requires a fresh approach.