The processor is at the heart of all computing. It takes any data that can be made digital and changes it via maths and logic, as many times as we like. As Alan Turing pointed out in the 1930s, this is something that can do automatically anything we can describe in mathematics — and in the digital world, all problems are mathematical.

After Turing’s theoretical insight, it’s just been a matter of the appropriate engineering to make the process affordable, efficient and practical enough to see real-world uses. The history of IT is thus driven, at heart, by that engineering — and the future of IT will be written by how we can continue to improve the physics and design of manipulating data.

This is the first of two articles on the basic driving forces that will shape the near future of processors. It covers architecture — the way fundamental building blocks create processors: the second article will cover the physics behind them.

Architectures come in two sorts; von Neumann and the rest. The first has powered essentially all computing to date: the rest, which includes quantum computing, neural networks and an assortment of other schemes, are vying to create new ideas that can power the advancement of computing well into the 21st century.

von Neumann

The von Neumann architecture is named after one of the founding fathers of digital computing, John von Neumann, who set down its principles in 1945 following work he did at the Los Alamos laboratory during the invention of the atomic bomb. It is at the heart of practically all computers that have been or are being built, including the one you’re reading this on.

The von Neumann architecture is thus extremely familiar. The core components are an arithmetic and logic unit, which does actual work on data; storage, which contains data and instructions; Input/Output or IO, which connects the system to the outside world, and a control unit, which synchronises everything.

In the classic von Neumann architecture, instructions and data share the same memory bus, meaning the computer can’t access both at the same time. An evolution of this, the Harvard Architecture, has separate buses for instructions and data, removing that bottleneck. Most modern processors use a hybrid architecture, the Modified Harvard, which has a single external memory bus accessing mixed instructions and data, but a split internal cache that separates out the two and gives them independent, parallel paths to the rest of the processor. In all cases, the processor is general purpose — that is, it can run any computational task — and if the system runs faster, all programs run faster.

The age of clock speed

For a long time, the fundamental performance characteristic of this processor design was the clock speed. The control unit marshals the flow of data and instructions by setting up internal pathways between registers, memory and logic, waiting for them to settle, and then transferring data. The clock sets the speed at which this happens — too fast for the internal circuits and the processor malfunctions, to the chagrin of overclockers — and thus how quickly programs run.

Between the early 80s and the early 2000s, desktop processor clock speeds increased by around a thousandfold. As transistors on chips shrank according to Moore’s Law, which says that every eighteen months or so you can double the number on the same area of silicon, they could go faster. IO and memory bus speeds didn’t keep pace (the physics of moving data off-chip is far harsher than on ), so designers evolved ever more complex and capacious cache mechanisms and architectural tweaks to move as much information as possible in one go onto the chip, then swiftly process it there independently of external memory. Blocks of instructions were chopped up and fed into pipelines that could run at full speed on multiple instructions at once, while on-chip analysis of program pathways and data optimised use of the chip’s various sub-units and external memory access.

This produced two decades of continuous general performance increase, driving huge amounts of cash back into ever more subtle architectural enhancements that marketing departments could stitch back into a tale of serene progress that was believable, if not always strictly true.

But since the early 2000s, mainstream clock speeds have barely doubled, primarily because of power problems. While Moore’s Law is widely seen as the basic description of performance scaling (the speed increase of smaller transistors is further boosted by the increased number you can pack in), another factor, Dennard Scaling, says that the power used by transistors is determined by the area of silicon they occupy. In other words, power use stays the same for a given physical chip size no matter how many transistors Moore’s Law gives you. For a number of reasons — to be covered in the second article in the series — this is no longer true: Dennard Scaling failed some time in the mid-2000s, with the result that faster chips became too hot to handle.

The multi-core age

As a result, the industry has turned to multicore, using Moore’s Law to produce multiple more-or-less independent processor units on a single silicon die. Each processor runs at much the same clock speed and has only incremental improvements in performance over its previous generation — there’s just more of them. The headline figures for overall performance continue to improve, but this is less and less indicative of speed increases in real-world desktop tasks. IO and memory limitations become even more severe, as there are lots of cores to be fed through the same bus. There is no general-purpose way to make computing tasks exploit multiple cores, and parallelisation introduces many new problems that complicate programming.

Multiple efforts are underway to try and continue to extend the performance of von Neumann architectures. IO and memory bandwidths can be increased by moving from electricity to light, with photonic interconnects promising much higher speeds over greater distances at lower power; 3D stacking, where memory and computation are brought much closer together, is another promising approach. GPU architectures, creating seas of specialist processing units tightly bound to memory, has much potential for those computing tasks that can be split up into multiple independent jobs. The changing nature of computing — away from the desktop into centralised data centres running identical tasks for thousands or millions of users — gives the multicore model extra potency. Also, sophisticated on-chip power management can allow processors to better track the needs of those many computing tasks that are bursty, needing only intermittent or partial high performance.

One key aspect is heterogeneity, which is a move away from the von Neumann ideal of the central processor performing all tasks equally well. Instead, different tasks get specialist treatment. This process started in the 80s when maths coprocessors started to be integrated onto CPUs, and has now seen growth into areas such as streamed media manipulation, high-performance networking and cryptography. Most recently, multicore chips have seen overall control devolved to low-power marshalling units controlling matrices of specialist, higher-power cores.

A mobile world

Many of these factors have driven, and benefited from, the move to mobile, where core speed performance is far removed from usefulness. It’s quite remarkable how the industry has successfully moved on from the golden age of desktop PCs running a seemingly impregnable combination of a single instruction set and core software. If that were still the primary mode of computing, the end of clock scaling would have been catastrophic some time ago. The move to cloud and mobile, fed by the growth of high-bandwidth ubiquitous wired and wireless networking, has moved the focus to areas such as media consumption, efficient manipulation of very large data sets and extreme efficiency, and away from doing most of the work on the device in front of you.

It’s a measure of the basic magic in the von Neumann architecture that it remains capable of significant evolution through adaptation that’s expected to continue until the end of Moore’s Law (around a decade away, by some estimates), eighty years after John von Neumann described it.

Integrated processor and memory

One of the inescapable bottlenecks in von Neumann architecture is the movement of data from memory to CPU and back to memory again, which takes power and time. An obvious solution is to make memory part of the processor, hence on-chip caches. An even more powerful idea is to integrate computation directly with each bit of memory, so that as soon as something is stored it can be immediately worked on in place. One way of thinking about this is making every byte of memory its own core, so a 64-megabyte chip is automatically a 64-megacore processor.

This is not practical with the von Neumann architecture, which decouples memory from CPU so that the processor can read and write any byte as instruction or data. One half-way house is Venray’s TOMI Celeste architecture which couples specialised von Neumann cores very tightly to on-chip fast DRAM. The chips also include communication controllers, the resultant modules being interconnected through a custom 3D network topology. The system is optimised at every level for classic big data tasks such as MapReduce and graph analytics, where Venray claims up to 90 percent power savings over commodity hardware.

A very different — and categorically non-von Neumann — approach is Automata Processing from Micron. This equips each byte of memory with its own programmable pattern-recognition hardware, so that if you put the number 42 into that byte, a signal is immediately available that says ‘There’s a 42 here’. These are then chained together so that if you want to search for a 42 followed by a 230, the next match happens in the next cycle. Automata Processing (AP) is further speeded up by parallelising searches — there are 256 signals available for each byte, so that every possible number is covered. If, after you’ve found your 42 and you know you want to look for either 230, 7 or 55, those three comparisons can be made in one cycle and a hit on any of them will set up the corresponding signal for the next cycle. Any search will succeed in the same number of cycles as there are consecutive matches.

This can be extraordinarily efficient at pattern matching; you can program the chip to directly react to chosen rules without the need for any manipulation of the data it contains. Micron talks of thousand-fold increases in speed over existing architectures — for pattern matching.

Whether this will be enough to justify AP’s existence in general is another matter. As well as identifying tasks that will benefit from this approach (network routing and malware scanning are desperate for efficiency improvements), there are huge questions of how to integrate it with the existing computing infrastructure and how to give developers the tools to be productive in such an unfamiliar way of thinking. TOMI Celeste is in far better understood territory in both cases.

Neural networks

Although Turing considered that his Universal Machine could emulate the functions of the human brain, the von Neumann architecture is an impracticable approach for reasons of speed and efficiency. The brain is an extraordinarily complex computational mesh where networking, memory and logic combine digital and analog, and are vastly intertwined. Neural networking has long been seen as a way to emulate this, by building analogues of the way the brain’s signalling and component-level decision making works.

Neural networks take inputs, analyse them by passing them through an internal structure, and produce outputs when the inputs correspond to a recognisable state. They can adjust their internal structure by internal or external feedback, learning novel patterns by comparison to previously understood states or through external signals received after an output is produced.

So far, so good — except that the human brain has approximately 100 billion neurons (the basic unit of neural decision making), which can each have tens of thousands of inputs from, and hundreds of connections to, other neurons via synapses. There may be as many as 1000 trillion synaptic connections, which are themselves dynamic and capable of changing physical structure. The whole lot takes around twenty watts to run — compared to an estimated ten megawatts or so for similarly functioning state-of-the-art standard computing.

However, with such an astonishing target to aim for there is enormous interest in simulating biological neural networks. Among the many past and present projects, one that makes the boldest claims comes from IBM and its experimental TrueNorth chip, a 5.4-billion transistor, 4,096-core design that simulates a million neurons and 256 million interconnections, consuming under a hundred milliwatts while recognising moving elements from visual scenes in real time.

TrueNorth emulates many of the key features of biological neural networks — signal delay, hard-wired interconnectivity, neuron reaction to inputs, synapse behaviour after a signal is received — by routing signals through an internal packet network and configuring component behaviour through integrated computation and memory units.The basic building block is a core with 256 inputs to and 256 outputs from its neurons, connected by a 256-by-256 array of synapses. Neurons within one core can directly address neurons in others, while signals are spikes of activity rather than continuous connections. Once a signal reaches a core, it can then fan out internally to multiple neurons — this emulates how brain neurons send their single output along a very long connection called an axon, that splits up only at its end. It’s also very power-efficient.

The whole chip has a 64-by-64 array of cores supported by mesh networking and is clocked at 1kHz (a mere thousand times a second), but each clock event sparks off an entire wave of activity that flushes through the system under its own timing. IBM claims this system consumes 176,000 times less energy than a simulation on general-purpose microprocessors, and some 700 times less energy than the best current custom designs running identical tasks. Although IBM admits it isn’t possible to compare TrueNorth’s architectural efficiency with standard processors running standard tasks, it can’t resist comparing its 400 billion synapse operations per watt with the 4.5 billion flops/watt of current large systems.

Even at this level, neural computing is a long way from brain emulation — and, like all non-von Neumann architectures, will be immensely difficult, if not forever inappropriate, to use in general-purpose tasks. The raw figures for the human brain do not begin to cover the huge variations at all levels and structural mysteries it still contains. On the other hand, TrueNorth is a major step forward for neural networking — not least because the chip is built using standard Samsung 28nm technology, and, IBM says, is arbitrarily scalable to multi-chip systems.

There are many other efforts going on in parallel with this, in understanding and emulating many of the brain’s larger bus and module structures. Neural networking remains the single most exciting non-von Neumann area of research, with the potential for solid progress in architectural discoveries well past the point where Moore’s Law may cease to be the industry driver.

Quantum computing

Few areas of cutting-edge computation attract as much interest and controversy as quantum computing. It is the furthest from standard computing, relying on the manipulation of components that can contain multiple different states at the same time with the right answer appearing as the collapse of those states under the remarkably counter-intuitive laws of quantum physics. It is often only probably the right answer, and often further runs of the computer — or other checks — must be done to ensure that the truth has been reached.

The basic component of quantum computing is the qubit, which corresponds to the bit in standard computing. However, while a string of bits can only contain a single number at any one time, a string of qubits can contain any or all of the possible numbers that string can represent – a phenomenon called superposition. Sixteen bits can represent any number from zero to 65,535 — strictly speaking, it can hold just one of 65,536 states — whereas 16 qubits can hold all of those states simultaneously. In classical computing, it takes a long series of operations to find out or manipulate information about a number — is it prime, what are its factors? — that can be exponentially longer the larger the number is. By contrast, a quantum algorithm acts on the qubits to extract that information in a much shorter time, passing the value of qubits through a configuration of quantum gates that manipulate the probabilistic states they represent, winnowing down those states step by step until one state exists at a much larger probability than the others.

The mathematics behind this process, although inaccessible to mortals, are very well understood — quantum theory as applied to the sorts of physics found in quantum computers has been exhaustively verified. The practical problems, however, are also insane. Superposition is very difficult to usefully create and control, collapsing to a single state at the slightest excuse. Any attempt to observe or debug a quantum computer is almost certain to destroy the state it’s looking at.

Qubits can interfere with each other, making it very hard to handle usefully large numbers, and some potentially useful quantum algorithms demand astronomically large numbers of quantum gates to create interesting results. There are a phenomenally large number of ways errors can creep in, and a truly minuscule number of ways of finding and correcting them. This is particularly irksome, as the main theoretical advantage of quantum computing — the so-called quantum speed-up over conventional approaches — is only apparent in systems too complex to build reliably.

To date, just one company, D-Wave, has claimed a functional quantum computer that’s available for purchase, although perhaps the major result demonstrated so far is that among their many quirks, such things may be very difficult to benchmark.

However, this is a very intriguing field for many reasons, including the fact that quantum computing can theoretically cope with some tasks that conventional computing will never be able to practically manage. The exercise of overcoming the incredible variety of challenges attracts a lot a very smart people, not least because it represents a rich field of frontier technology with potentially enormous potential. Fundamental physics becomes a lot more interesting if it has immediate potential.

Finally, there is intriguing evidence that many of the challenges in quantum computing have already been solved — by plants. Photosynthesis converts sunlight to energy and distributes it with a speed and efficiency that is completely inexplicable by classic physics, and moreover does it in the rather less than pristine environment of your average cow pasture. This hints at truly fundamental insights into the core processes that one day will power quantum computing. For now, look at grass with a new respect.