The Cell chip is the processor found inside of the PlayStation 3 and was one of the first proper multi-core chips to appear in the marketplace.
As the Cell has seven usable cores and some exotic memory features, it can offer more parallelism than other chips in the marketplace but it comes at the cost of ease of programming. We spoke to Michael Paolini, IBM Master Inventor, CTO Workload Optimized Systems, Strategic Growth Business Development, IBM Systems & Technology Group in an email interview to discuss the challenges faced by this difficult yet highly parallel architecture.
What techniques can developers use to take better advantage of parallel environments?
Generally speaking this is a combination of things. For the last few decades, we’ve worked hard to serialise many things that are actually parallel in nature, and now we must begin to reverse that. So the first step is recognising parallelism not just at the program or process level, but all the way down to the instruction level. This is accomplished best in my experience by concentrating on partitioning the data, and then partitioning the tasks that get applied to the data during its life cycle.
There are many areas of computer science where this is routine: data and task parallel programming has been with us almost since the beginning, and it has been in constant use anywhere performance was critical (graphics, video, streams, vector math, etc). However, this is a very different style of programming from assumption of a single thread of control employed by many programmers. So there is actually a bifurcation of programming methodology here. Those who learned the serial discipline for control purposes, and those who dealt with enormous data, and used a divide and conquer strategy. Both have their places and uses, by the way, but the new reality of power and transistors, which is pushing parallelism and a rethinking of computer science, is going to force the control plane programmers out of their comfort zone and into parallel thinking if they want to keep growing in speed of execution.
Are there any differences between developing for a new multi-core environment compared to the traditional multiple CPU environment? What are the consequences of this?
Yes, new multi-cores such as Cell/BE have actually been architected from the ground out to be multi-core and many-core. For example Cell/BE has an internal communications ring that provides >300GBps of bandwidth to the cores at the current clock speed. It also has mechanisms for reserving that bandwidth for cores, and communications mechanisms in place. In contrast, most of the traditional multiple core CPU’s lack these kinds of mechanisms, and some of them even require communications to hit their front side bus to communicate between cores (which ties up a precious resource).
Additionally, newer architectures designed from the ground up to be multi-core and many-core take a further step forward in partitioning what runs where. For example, traditional cores have the OS running across all their cores. This introduces “jitter” in the programs since the OS always wins the battle for cycles or resources against a user program. That jitter can cause inconsistent performance results from run to run, and make things very difficult to debug. By contrast Cell/BE only runs the OS on the Power Processing Unit (PPU), and only user codes run on the Synergetic Processing Units (SPU’s). That is to say, the additional cores are dedicated to running only User mode apps, not the OS. This reduces complexity, helps reduce jitter, and gives much more deterministic results per run. In addition to great performance at a lower power footprint, many people find it much easier to debug as well. For both traditional and new, the same rules of thumb apply by the way — reduce synchronisation and dependencies, reduce inter-core communications, partition the data. We find that all things a programmer does to make the new cores like Cell/BE go faster also speed traditional processors up as well — just not as much since most of them were retro fitted for multi-core.
Can developers conceptually think about many parallel instruction executions? Should developers need to?
Developers can and do think about many parallel instructions executions today. For proof, all one [needs to] do is look at any application where graphics or video are involved. Partitioning up the image for processing is routine and easy. Partitioning is down at an x and y level, or even at a z level where z is time. Similar examples can be found in databases, stream processing, clusters, search, almost anywhere you look.
Even non-developers routinely think in parallel instruction executions if we look at programs like spreadsheets. Conceptually to the user, those data columns are being acted on in parallel even if, under the covers currently, it is executing serially because of traditional processors. I think anyone who’s ever had the experience of going to get a coffee while they wait on a spreadsheet update would tell you they’d like the traditional processors to catch up to the concept.
Cell has traditionally been seen as a platform that is difficult to program for, why do you think this is?
Cell/BE emerged on the scene prior to multi-core/many-core really going mainstream. People didn’t realise that we had reached a point of diminishing returns on single-thread core performance, and that the whole world of processing was going to have to change thanks to the laws of physics. As a result of being first on the scene in a big way, people equated the difficulty of programming explicitly in parallel with Cell/BE. However, the Cell/BE is not harder to program than any multi-core architecture, at least, so we are told. And I believe because it was architected from the ground up for many-core rather than being retrofitted, Cell/BE mechanisms like the built-in communications, bandwidth reservation, user code only SPUs, and the like make it far easier to program and get significant performance than any other multi-core out there. That is to say, Cell/BE programmers hit higher percentages of peak operations for longer periods of time than pretty much anything out there. While I’ve never seen a good definition of what “programmable” is, I do think that sustained use of more of the hardware more often is a pretty good measure.
Perhaps also worth saying is that there are two other aspects that come into programming Cell. First is single instruction multiple data (SIMD). SIMD is in all modern processors in one form or another. Typically it is a subset of registers, and a subset of instructions, and to use it effectively you must partition and align the data in memory — but when you do, the results are fast and efficient, so much so that Intel is now on version 4.1 of SSE (Streaming SIMD Engine).
With Cell/BE, pretty much all the registers and instructions are enabled for SIMD, not just a subset or separate set. This I believe makes SIMD easier to program on the Cell/BE than on any other architecture out there. The last aspect of programming Cell is the data locality aspect of programming. All developers deal with this whether they realise it or not, and usually at the most difficult and expensive part of the development cycle: performance tuning. They fight to pin things in cache memory, to make them contiguous, etc. And they are typically fighting automatic predication mechanisms built into the hardware — many of which are very good by the way, but none of which is better than a programmer who knows his own code. This has lead to larger and larger cache sizes, and the use of more and more transistors and power to duplicate what you have already paid for in main memory. Cell/BE takes a different approach, we ask the programmer to tell what they need, when they need it, and when they are done with it. They do this by handing Cell/BE’s DMA unit lists. The lists support scatter/gathers — which further aids SIMD by the way, and can be as simple as having the compiler do this for them, or as manual as coding a put or get instruction. This also has a side effect of making the average Cell/BE program stall less than 1 per cent of the time, and removes some of the performance jitter that automatic prediction mechanism introduce. It also makes Cell/BE a fraction of the size of other processors because it doesn’t have to keep growing the cache to make that performance tuning easier.
So Cell/BE is different and people find differences scary. Further, many programmers take existing code — code written without such strong discipline — and attempt to port it to Cell rather than rewriting it. That frequently leads to poor results.
Summary: Cell is not hard to program and not hard to port from. However, code written for platforms that do not require strong programmer discipline is difficult to port to Cell.
If a developer wishes to target Cell, what are some of the architectural quirks that they should be aware of?
Please see above.
To what extent do you think development tools (compilers, frameworks etc) assist the uptake of a new multi-core processor?
Some, but automatic never yields the best results. This is really a question of how much performance is a developer willing to give up for automatic. This is no longer just a developer discussion; it is now a datacentre discussion as well, given the power, floor space, and cooling issues. Recovering 10-20 per cent can mean a big deal to a datacentre.
Do you think it is possible to see a superior architecture ignored by the industry because it is too hard to program for, despite any performance benefits?
Anything is possible, but we think Cell is the right part at the right time, and it will be aided by the fact it is as easy and cheap to get a hold of as buying a PS3.
If compilers are able to effectively convert a legacy application into an efficient user of multiple cores, should developers need to take the number of cores into consideration?
Developers should assume a variable number of cores going forward — any time you assume and target a specific moment in time, you are making the migration to the next generation or platform harder and more expensive than it needs to be.