Data Management

The most ignored programming story of 2007

Justin James reveals what he considers the "sleeper" programming story of 2007.

In my previous meta post, I promised to start the year with an article about the "sleeper" story of 2007 and describe why I think it's slipping under a lot of radars. The story is, of course, about multithreading and parallelism.

To the vast majority of programmers, this topic has never been interesting, and it did not make sense for them to bother with it — after all, with only one CPU core available, multithreading is done through timeslicing at the OS level, which is extremely inefficient. Using more than one thread per process (or process per application) only makes sense in the most trivial of cases (one thread to act as a worker and another to monitor the "cancel" button) or for systems that spawned children to service networking requests, like a Web server or a database server. Most programmers can handle the "cancel" button variety with a bit of effort, and only a select few programmers work on those network server projects. In the last few years, the number of CPU cores (first, logical with hyperthreading and then physical as well) suddenly started going higher than one on mainstream equipment, thanks to the plummeting price of multiple socket motherboards for servers and dual (and now quad) core CPU hitting both the desktop and the server room at bargain bin prices. Sadly, this has been a response to the fact that current CPU materials will set themselves on fire if the clock speeds are increased past their current spot. Thus, packing more CPU cores in a physical chip maintains Moore's Law — without increasing the clock speed. I predicted that multithreading would become very important, since the speed of single-threaded execution is not going up nearly as fast as it did in the past.

As it turns out, very few programmers seem to care about multithreading. Those who have thought about the topic really do not see the need to use it as a technique in their projects, which are typically written in Java, VB.NET, C#, or PHP. Most programmers are writing applications that perform "data processing" as opposed to applications that perform "computations." The difference is quite important.

Data processing is about dealing with data as an aggregate set (maybe drilling down to a handful of records out of many), and performing trivial calculations at the row level that become quite a task at the data set level. Data integrity and accuracy are more important than speed, although speed is important as well. You can usually spot a data processing application because it does not do anything that a Microsoft Access or Microsoft Excel application does not do — just not at the same scale. Data processing apps tend to be I/O bound and not CPU or RAM bound.

Computational applications are typically centered around a single, relatively free-form chunk of data, such as an image, movie, sound bite, or even a text document. When they are working hard, the CPU and RAM systems are stressed to the max, but the I/O system might not have a single byte going through it until everything else is said and done. Speed is usually more important than robustness; no one expects to be able to recover if the power goes out in the middle of a lengthy processing period.

Since most of us are working on the data processing-style applications, the multithreading that needs to occur happens either in the parent process (like the Web server that spawned the process that our code runs in) or in one of our child processes (such as the thread in the database server that is processing our request). Most of the work that the code we write does is input validation and output formatting. We rarely even worry about record locking at the application level; as long as we lock the rows in the database, we feel that it is highly unlikely that two people would modify the same record at the same time. We don't even bother to find out what our application server does about locking the Session system so two calls to the same page with different parameters do not cause chaos.

Most programmers don't care (and don't need to care) about multithreading and parallelism because 99% of the parts of the application that need to work like that are written by either the application server vendor or the database vendor. Our work is not computationally intense enough to justify either the CPU overhead or the additional work.

Is there something wrong with this? Not really, as long as programmers are getting a paycheck and enjoying their work. However, it speaks volumes about what tasks programmers get paid to do, and the companies that pay them to do those tasks. I wonder why companies still cannot see past data processing as a task for computers to handle.



Justin James is the Lead Architect for Conigent.

Editor's Picks