The recent MicroProcessor Forum has provided new information on the upcoming AMD Hammer processor. Expected by the end of 2002, AMD’s first foray into the server market offers a unique combination of backward compatibility coupled with forward-looking, 64-bit support. While the current Athlon processor is capable of competing with the Pentium 4, it is Intel’s upcoming 64-bit server processors from the IA64 family, Merced and Itanium, being targeted by AMD.

AMD has proposed an extension to the x86 instruction set called x86-64, which is basically an expansion of the 32-bit x86 instruction set. In this environment, the x86-64 processors are compatible with current 32-bit applications and operating systems, while the 64-bit Intel IA64 platform uses a completely different format. AMD claims the x86-64 change will be like going from the 16-bit x86 instructions of the 80286 processors to the 32-bit x86 80386 and 80486 processors.

The Hammer processor will be a different form factor and layout than current Athlons, so it is not an upgrade chip. It will be compatible with existing hardware standards and BIOS systems, although some modifications will be needed to support the full feature set. Hammer will include a veritable bonanza of multimedia extensions. In addition to AMD’s own 3d Now and Enhanced 3d Now, the Hammers will be able to utilize Intel’s SSE and SSE2 instructions. This is a new feature, as current Athlons are not able to support SSE2, and older Athlons lacked any non-AMD extensions. The multimedia extensions improve advanced mathematical operations used in games, video editing, 3d modeling, and CAD applications.

The technology
The Hammer processor consists of the CPU core, L1 and L2 caches, and an onboard North Bridge. The North Bridge is based on the AMD’s new bus technology, HyperTransport, which was formerly called Lightning Data Transport. The North Bridge is supposed to have a total bandwidth of 19.2 GBps across three HyperTransport interfaces. Among the 19.2 GBps of total bandwidth, the North Bridge provides 6.25 GBps of bandwidth for IO devices (i.e., the PCI, AGP, and USB buses) and 2.7 GBps for the memory controller, which leaves 10.25 GBps for interprocessor communications.

Need a refresher?

See my motherboard chipset article for explanation of North and South Bridges as well as a lesson on HyperTransport.

The memory interface supports PC2100, PC2400, and PC2700 DDR memory with up to eight registered DIMMS. Using current memory density, this gives a Hammer CPU 16 GB of RAM at 2.7 GBps, which is a bargain when compared with 1.6 GBps for a single-channel Rambus or 3.2 GBps for a dual-channel Rambus, even at today’s cheap memory prices.

The system has support for 40-bit physical and 48-bit virtual memory addressing. This translates to a system with capabilities of handling 1 terabyte of RAM and 256 terabytes of virtual memory. This is important due to the use of coherent memory, where the RAM attached to each processor is available to the other processors in the array.

Sharing memory between CPUs in a four-way system is a nine-stage process, unless you have a dual-CPU system; then, it’s six stages. Latency will be much higher than local memory, but AMD says it will seem as though there is a DRAM page conflict on normal memory, meaning the data needed was thought to be in memory, but wasn’t. Four-way systems should expect average latency across the system after a page miss to be approximately 140ns, with 160ns on eight-way systems. Latency shrinks as the CPU speed increases. Also, the interprocessor HyperTransport connection will become faster, since it is an integrated component.

The level 1 (L1) and level 2 (L2) caches have not been specified to date, as the processor has yet to go into final silicon production. Some sources indicate that the L1 cache will be 64 KB, although whether this is the combined data and instruction cache or the size of each is not clear. The current Athlon has a 64-KB/64-KB L1 cache, which is likely to be the case on the Hammer.

In between the L1 and L2 caches is a Trace Lookup Buffer (TLB). This device guesses what data the processor will need. Since this occurs independently of the normal schedulers and data management scheme, it increases the effective performance of the caches. Currently, Athlon MP processors come equipped with a TLB but perform about as well as an older Athlon with twice the cache and no TLB. Only in cases where the data exceeds the size of the Duron’s smaller cache can the older Athlons outperform the “economy” chip. Placed on a high-end chip, it creates an extra level of efficiency that bypasses some of the latency problems I described.

L2 cache is expected to be 1 MB, four times that of the Athlon or P4, although eight-way systems may sport up to 2 MB of L2 cache to increase the performance across the distributed memory. Larger L2 caches will reduce the number of page misses and avoid the additional latencies, but at the cost of additional processor space and transistors.

Once data gets past the caches, it reaches three series of instruction decoders. These decoders take the rather bulky x86 instructions and break them into more efficient microoperations. These microoperations became common on x86 processors years ago and, combined with a good scheduler, can add significant performance.

In the Hammer’s case, these three decoders drop their instructions into an additional four schedulers: three feeding the integer arithmetic logic units (ALUs) and one far more complicated scheduler feeding the floating point units (addition, multiplication, and miscellaneous functions that handle the multimedia extensions). Thus, for every cycle, the processor could perform four operations: three integer and one floating point. If the scheduler is capable of simultaneously feeding all the floating-point units, you could potentially have six operations: three integer and three floating point.

Symmetric multiprocessing
Hammer is intended for multiprocessor systems. The combination of integrated memory controller and bus interface turns it into a “glueless” processor. Other processors need “glue” logic to handle the relationship between processors, the memory controllers, and the system buses to ensure all components get a turn. By giving each CPU the interface it needs, conflicts are eliminated.

The limit of processors will be based on the complexity of the HyperTransport interface of each processor. SMP arrays expecting more complex interaction between processors will need extra interprocessor bandwidth. In theory, all Hammer systems could be equipped with sufficient bandwidth to handle eight-way systems. In reality, only the Sledge Hammer variants will have support for more than two processors, with Claw Hammer processors being single- and dual-processor workstation chips. This allows the Claw Hammers to be cheaper by cutting down the HyperTransport interfaces and the sizes of the L2 caches.

Data center role
Hammer processors will provide a type of platform unseen since the days of Windows NT on Digital/Compaq Alpha processors. Despite Intel’s best efforts with the Xeon series of Pentium processors, the x86 processor has never had a serious presence in the “big iron” world of corporate data centers. Various flavors of UNIX, many running on 64-bit processors, are the heart of the most major corporations.

Years ago, when you could buy an Alpha running NT, you had a number of upgrade paths. Straight from the manufacturer you had Tru64 UNIX at a significant cost. Also, there were the relatively inexpensive BSD and Linux/UNIX variants. For the longest time, Alpha users expected to have Windows 2000 available as an upgrade path, but there were conflicts between Compaq and Microsoft that eventually made that a fleeting hope. Regardless, as you went from a small company running Windows NT services on an x86 Intel desktop to a midsize company with Windows NT on a low-end Alpha, you could migrate to a cluster of Alphas with UNIX without having to buy completely new hardware.

Hammer is intended to be an even smoother upgrade path. Where Alpha could only run NT, Hammer will be able to run everything from DOS to Windows 2000/XP without the least bit of modification. While Windows 9x/Me and 2000 probably won’t get much more out of a Hammer than an Athlon, Microsoft beta testers have found notes in some header files that Windows XP is already aware of the Hammer processor.

How well that awareness will translate into performance will have to be seen. It could require a wait for Microsoft’s sequel to XP to actually take advantage of Hammer’s 64-bit potential. However, whenever that Microsoft x86-64 OS does come out, you will still be able to run all your existing software on it, upgrading only those programs that need the extra performance provided by operating in 64-bit mode. This contrasts starkly with Intel’s IA64, which will require all new applications or the use of inefficient emulators.

Fortunately, we won’t have to wait for an OS to run on the Hammer, as Linux will be immediately supported at launch time. AMD has spent much effort courting the open source community, as it will provide the testing grounds for many organizations’ first experiments with Hammer. A variant of the Gnu C Compiler (gcc) is already available with x86-64 support.

The importance of compilers cannot be stressed enough. As proven by the Pentium 4, a processor-aware compiler makes as much, if not more, difference than increased clock speed. Tests show the Pentium 4 becomes about 25 percent faster on high-end applications when built using a compiler that supports the P4. Of course, the P4 has been out for over a year, and not all languages have nonbeta P4 compilers. If AMD can’t take advantage of code optimization by ensuring compilers are readily available at release time, the Hammer processor will be a short-lived product.

Software with full x86-64 recognition is expected to be no more than 10 percent larger than the same program with only 32-bit x86 support. The code growth is mainly due to additional instruction prefixes and headers incorporated into the compilation. AMD reports that Hammer-aware compilers will actually reduce the size of many aspects of an application due to the extra capacity of the processor without performing optimization within the code.

The Hammer processor will prove an interesting challenge for Intel and give new options to CTOs looking to upgrade their systems. UNIX on x86 has brought the stability of the Fortune 500 data center to the Fortune 5,000,000. Bridging the 32-bit-to-64-bit gap will let the masses gain the capability to handle large amounts of data that today’s digital world needs, and scalable SMP will provide a way to supply clustered servers. If AMD can deliver, it will be an interesting world. It’s a big “if,” but given the unexpected success of the Athlon, it’s not out of the realm of possibility.