Although parallelism may be a new concept for many programmers, there are some for whom the concept is a part of their daily responsibilities.

Sun Microsystems is one of the companies that have been involved in the long history of parallel environments, and we spoke to one of their engineers, Richard Smith, to discuss how programmers can come to grips with the challenge of parallelism.

What are the challenges in programming in a parallel environment?

Clearly there are a number of challenges because for a lot of the software industry they are not familiar with writing multithreaded applications.

We’ve seen applications that might have run effectively on a small number of cores and as they move to boxes with a very large number of cores, they haven’t scaled.

Things around serial bottlenecks and the use of synchronisation primitives — all of these can reduce the performance of an application when you move it to some of these very large parallel environments that we are seeing starting to come out in the marketplace.

Because I run Solaris on my laptop and I spend most of my time analysing performance problems, I’m familiar with a lot of the tools that are available to assist in these sort of analyses.

But of course it’s not a particularly scalable approach that I solve all the performance problems that are out there — we need to find ways of improving people’s skills, knowledge and their familiarity with the tools. That’s just something we need to keep working at.

I have spent a fair amount of time investigating problems where somebody has tried one of our new boxes for the first time and have been disappointed. But it’s not always the boxes’ fault — because what happens is that multiple changes are made at the same time.

People suspect it might be the architecture of the box but they don’t really know. When many things change all at once it, of course, could be one of a variety of factors including software version changes, or failure to implement certain basic tuning parameters suitable for the new platform.

Many applications suffer for fairly weak or no service level type instrumentation, which complicates the picture if you have experienced relatively poor performance end-to-end and you need to drill down to find out where the bottleneck lies.

Some of the tools that I make use of is Sun Studio 12, which is our collection of native compilers, and in particular I make use of the performance analyser. Using that, I can get a feel of where applications spend most of their time, identify serial bottlenecks, look at the use of synchronisation primitives — in the case of Java you sometimes need to pay attention to garbage collection, because garbage collection can be a serial bottleneck.

Unfortunately some ISV application mandate the use of older JVMs and these typically don’t scale as well on new hardware.

As an R&D company we spend a lot of money on research and development, and as part of that investment we’ve been looking at the use of transactional memory as a future technology that may help in improving the scalability of applications. There’s a prototype implementation that we’ve made available [so] that people can play with an API, which would allow you to look at how you could use transactional memory. At some point in the future we will have a hardware implementation that will support the API.

How does one start to deal with synchronisation in massively parallel environments?

It’s probably best tackled at the design stage when you are thinking about the problem you are trying to solve. There are certain heuristics that you can use for how you could decompose a problem into pieces that you could solve individual[ly] and combine the solutions from those pieces.

Some of the well-known techniques involve the use of loops and distributing portions of the loop across N processors — other times decomposition is impossible.

It’s certainly not something that is very readily dealt with, with “just follow the following recipe”. Just know that there is a range of strategies and people more likely to be successful if they’re familiar with some of the paradigms that are available and consider each of them to see if it makes sense to their problem.

So at the moment that makes debugging more of an art than a science?

I’d like to think that art and science aren’t very far apart and there’s bit of both involved. Certainly you need some creativity to think of possible solutions and you need some science to allow you to be able to make sensible predictions on what might happen should you choose one of the possible alternatives.

Is a lot of the problem just getting developer’s heads around what they are doing?

Yes, it is true. If people have only ever written serial code then they will not be familiar with some of the problems that occur when you parallelise an application; such as serial bottlenecks and how you can unwittingly create a serial bottleneck by using synchronisation. Worse still are the subtle kinds of problems where you have data races or deadlocks, because you may not notice them when you are doing testing, they may occur rarely but when they occur they may be diabolical.

Is another problem you see being that developers do not understand threading properly if they come from serial programming?

My experience has been that a large number of customers are not writing their own applications, they are using off-the-shelf applications and to them sometimes it is all a black box. It performs the way it does and if they’re not happy with it they want to know about it or what can they do about it. The people we perhaps need to tackle would be the ISVs who are largely responsible for much of code being used out there.

Having said that though, we know that there are companies that are investing in developing application around frameworks that are parallel anyway, such as J2EE.

One of my concerns has been customers where they are not happy with the performance and yet they have hardware resources sitting around idle. So one challenge is to find ways of making use of all available hardware to make a transaction go as fast as possible rather than have the hardware sitting around idle.

I’ve seen applications where you have a transaction that may take 20 seconds [which] isn’t fast enough for a particular purpose, but if it is a single thread it’s going to be able to take advantage of more than one core.

The challenge for some [of] these applications is to get developers to use the concept of things like thread pools to make it easy for them to take advantage of additional hardware and get their applications to scale automatically as more hardware becomes available.

Is that challenge you describe something we will see frameworks help developers more with rather than developers trying to do it themselves?

I can see that there is a role for infrastructure frameworks and OSes to assist, but I do feel that there is still something inherent that will be required for years to come on the part of an application developer to break their problem down into component pieces.

And then have the mechanics of parallelising these individual pieces made much simpler by having the framework deal with it.

How much of the knowledge built up in dealing with multi-processor environments can be transferred to a multi-core environment?

I think anytime that you understand any of the general principles of contention and sharing of resource that it stands you in good stead. That you’d be on the lookout for serial bottlenecks, and some of the tools will help facilitate drilling down to identify where the bottleneck is.

If you were running an application and you were running experiments in which you were increasing gradually the number of threads you were using and yet your throughput wasn’t going up to match the use of threads, then that would suggest you’ve got a serial bottleneck or a hardware resource is saturated.

So if you want to improve throughput you need to identify the bottleneck or resource and apply the usual standard heuristics: use less of it or find ways of adding more of that resource or eliminate the bottleneck in the first place.

Is there any light on the horizon for solving serial bottlenecks or at least improving the situation?

Transactional memory is one of the shining lights people hold a lot of hope for. With transactional memory you do not use traditional locks, you are relying on the underlying hardware to deal with the situation where two threads might be trying to manipulate a common resource concurrently. The idea being that all of the changes would be made atomically by one thread before the other thread.

There are limits to what the current state of the art can deal with, but nevertheless there [are] many common situation[s] where that would suffice.

When we see this technology appear in silicon then there shouldn’t be a performance penalty for using it.

That’s just an ongoing research topic, people are investigating it and researching what you can do with transactional memory.