Thirty-five years ago, Intel introduced the 8086 processor. It’s a point of some importance that the latest Intel design, the Haswell architecture, will run code that’s older than some of the younger engineers responsible for the new processor. Because, for all the traditional razzmatazz that accompanies each new processor generation, the overriding theme remains as it has been for decades now: incremental changes that squeeze some performance increases without threatening Intel’s one massive advantage — compatibility.

Thus, although Haswell is a ‘tock’ design — a new architecture on an existing fabrication process — rather than the ‘tick’ of an old architecture on a new process, it inherits a multitude of architectural features from its Ivy Bridge predecessor alongside the 22nm process and tri-gate transistors. With the same 14-stage pipeline, dual-channel DDR3 support, 64KB L1 and 256KB L2 caches per core, it’s perhaps best to call Haswell a ‘toick’.

4th Generation lineup

The 4th Generation Core (Haswell) processor lineup looks familiar, too. There are four mobile classes: H (quad-core with integrated Iris Pro graphics); M (quad- and dual-core with discrete graphics); U (SoC with Iris graphics option); and Y (ultra-low-power SoC). Those last two are aimed at ultrabooks and tablets. Desktop Haswells include K for overclockers, ‘performance and mainstream’ quad- and dual-core processors, and low-power S and T series. With a range of power envelopes between 10W and 140W, Intel isn’t throwing in the towel in any x86 market any time soon.

Inside, Haswell has doubled some key performance points. For example, it can load and store twice as many bytes per memory cycle in the L1 cache as the previous architecture — 64 and 32, compared to 32 and 16. It also has 64 bytes/cycle bandwidth between L1 and L2 — again, double that of Ivy Bridge.

Graphics performance, so long the runt in Intel’s design litter, gets a considerable revamp. There are three new graphics options, HD Graphics 5000, Iris 5100 and Iris Pro 5200, all with 40 cores capable of running 16 operations simultaneously and speeds going up to 1.3GHz; the Iris Pro also has 128MB of on-package (but not on-chip) eDRAM, fabricated in its own low-leakage 22nm process. Codenamed Crystalwell, this is effectively an L4 cache that can be shared between GPU and CPU use — an architectural decision that leaves Intel with a lot of flexibility in configurations for future products

Power management

Power management is high on Intel’s hitlist, and Haswell has four voltage planes that give fine control over usage-tracked processor power consumption. The major innovation here is that the regulators controlling those planes are now integrated onto the chip instead of being off-chip. This means they can work a lot faster and more efficiently, simplifies board design and reduces a lot of electrical noise issues.

Haswell also has greater awareness of how long various devices and interfaces take to wake up, and can arrange to rouse them from sleep states in an optimal order to save power without losing performance. With the most highly integrated system-on-chip versions of Haswell, where most interfaces are on-chip and under direct control, this could result in factors-of-ten improvements in battery life under various regimes, but this remains to be confirmed. New versions of popular operating systems will also have much greater knowledge of how Haswell handles transitions between sleep states, and it’s this combination that’s likely to see the most notable increases in battery life rather than any great architectural change. Intel says that, with the right conditions, a Haswell system can be in a sleep state yet wake up regularly enough to gather data from the network or Wi-Fi to survive a week between charges.

New instructions

There are four areas with new instructions. AVX2 is a 256-bit-wide extension of SIMD, basically providing very wide vector arithmetic for mathematical operations on large blocks of data. Mainstream uses for this are primarily encryption and graphics work — Intel has already released an AVX2 version of the industry-standard Network Security Services, which it claims doubles throughput for RSA1024 and DSA1024, as well as many other improvements.

Another extension is FMA, a set of Fused Multiply-and-Add instructions. It’s common in digital signal processing, high-performance computing and real-world data analysis to have an addition following a multiply; this does both in a single step, speeding throughput and increasing accuracy.

A new set of Bit Manipulation for Integers (BMI) instructions adds the ability to do various tests, logic operations and manipulations on individual bits within registers or memory locations. That’s useful for cryptography, variable-format bitstreams, large-number arithmetic and hashes. They won’t change your life otherwise.

One of the major innovations is Transactional Synchronization eXtensions (TSX), which directly address one of the major design problems for software that runs on parallel systems such as multicore processors: what happens when two simultaneous processes attempt to access the same memory? If one is reading while the other is writing, you can’t tell whether the reading process will get the old data, the new data, or a mixture of the two. If both are writing at the same time, it’s anyone’s guess what the result will be.

The traditional solution to this is for each process to try to lock the memory and mark it as temporarily private. If the lock succeeds, then the process can carry on, do its business, unlock the memory and continue. If the lock fails because another process has got there first, the supplicant process has to wait for a while, retrying the lock until the other process finishes and unlocks.

There are numerous problems with this. That process of retrying for a lock effectively stalls the process in a very tight loop, defeating a lot of power management and other performance measures that assume a healthy mix of instructions being run. If both processes are only going to read from the memory region, there’s no need to lock it at all — but that’s often impossible for the programmer to know when writing the software. And it’s much easier to put a single lock on a big data structure before doing lots of work on it (coarse-grained locking) than to carefully lock and unlock smaller areas (fine-grained locking). But this leaves other processes locked out for long periods, which impacts performance.

Transactional Synchronization eXtensions (TSX) directly address one of the major problems for software that runs on parallel systems such as multicore processors: what happens when two simultaneous processes attempt to access the same memory?

Transactional schemes, which have been around in database designs for at least thirty years, protect the complete transaction rather than expecting code to directly manage each memory access. This requires considerable complexity that has not been economic to implement in x86 hardware to date.

The key realisation behind TSX is that the processor already has hardware capable of arbitrating between conflicting memory accesses: that’s what the cache mechanism does all the time. As a result, the cache system is pressed into service to provide two new ways to protect transactions in Haswell.

First is Restricted Transactional Memory (RTM). This has three new instructions, XBEGIN, XEND and XABORT. The programmer marks a chunk of code that must be allowed to complete without contention with XBEGIN and XEND. XBEGIN tells the processor to build up all the memory changes until it encounters XEND, when it attempts to commit all the changes in one go. If that fails because of other process activity, the processor rolls back the affected process to the state it was in before it started and transfers control to an abort sequence — which will probably try again with traditional locks. XABORT lets the process itself decide to do this, if it detects that it’s going to fail anyway.

The trouble with RTM is that it’s Haswell-specific and won’t run on any previous architectures. The other mechanism, Hardware Lock Elision (HLE), is rather clever: it works in conjunction with standard locks in such a way that pre-Haswell processors will just ignore it and use their older locking mechanism. Haswell, however, will effectively ignore the lock instructions and let processes assume they’ve got a successful lock even if another process is in conflict. If everything goes well because neither process writes new data to the protected memory, then all processes complete happily. If there is contention with writes to protected memory, the processor rolls back the affected process and returns it to a point where it set the lock — telling it the lock failed.

HLE has no new instructions; rather, it has two new instruction prefixes, XACQUIRE and XRELEASE, which modify existing instructions that can take a lock. Intel says it will be very simple for programmers to add these prefixes — ignored by pre-Haswell processors — to existing code, which will require no other changes.

Intel says that TSX will improve scalability in many database applications, seeing greater benefits from larger numbers of cores, while also reducing power-hungry memory accesses. Limitations imposed by the cache architecture constrain the memory areas that can be protected, but both conceptually and practically TSX seems like a useful addition to the parallel programmers’ armoury.

Incremental improvements

Haswell’s incremental improvements are a familiar cadence, with strong moves towards better system-on-chip products that have useful improvements in battery life, cost and form factors; other enhancements signal Intel’s intention to fight fiercely in high-performance and graphic-intensive markets. There’s no single big improvement, but for an architecture in late middle age to be capable of such feistiness on multiple fronts is commendable.

Or, in Intel’s case, absolutely essential.