Processors duke it out for PC supremacy

The Tyson vs. Lewis fight has been canceled, but that doesn't mean there isn't a good fight going on. Intel and AMD are battling it out with their heavyweight processors, the P4A and XP, respectively. James McPherson gives a blow-by-blow account.

In the battle for heavyweight processor of the world, quite a fight is brewing. In the Intel corner, wearing a blue and orange logo, weighing in at 55 million transistors, the latest in a long series of champion processors, the Pentium 4A series (P4). And in the AMD Athlon corner, wearing a green and black logo, weighing in at 37.5 million transistors, our challenger, the XP.

All right, enough with the metaphors. I set out to find an overall winner in the hotly contested market of heavyweight processors by comparing the top two processors in such areas as cache and memory. Read my examination of the specs offered by both of the chips' manufacturers to understand the inner workings of these processors and to see who comes out on top.

Pentium 4A architecture
The new version of the P4 has received the unfortunate designation "A". This makes things a bit confusing since the older P4 is just a P4. To simplify things, I'll refer to the new P4A by the code name Intel uses for it: Northwood.

Transistors are the little electronic components that are able to do yes/no operations. Link them together just right, and you can do all kinds of things, like logic, math, or storing data. The 55 million transistors in the Intel Northwood are about 13 million more than the first generation P4 and 27 million more than the Pentium III (PIII) uses. This should tell you that the Northwood is more complicated to create, and there are many things that could go wrong. The new Northwood uses a much smaller process that creates 0.13-micron circuits, as compared to the 0.18-micron system used in older P4s and the Athlon. The Northwood also introduces uses the new Socket 478, making the Northwood incompatible with older Socket 423 motherboards and making older systems obsolete.

Caching out
Cache memory is faster than system RAM and is intended to keep the processor busy. Level 1 (L1) cache, located inside the chip, holds the data the processor is working on (data cache) and/or the list of operations the processor is supposed to do (instruction cache). Intel integrated the instruction cache into the Northwood's pipeline so the CPU gets instructions in an orderly package. As a result, the Northwood’s instruction cache is fast and hard to differentiate from the data cache. It claims to store about 12,000 processor operations, or the very tiny affairs usually referred to as micro-operations, to differentiate them from the more complex programming operations.

The Northwood has an abnormally small (by today's standards) data cache at 8 KB. This is a quarter of the PIII's data cache and an eighth of the Athlon’s L1 cache; however, the Northwood's overall cache is incredibly fast. As long as the main memory is running as fast as the Northwood needs it to, all should be fine. If that cache runs dry, though, the entire processor sits idle. For how long? Well, pipeline stages indicate how many steps data must go through to get to the CPU. If the L1 cache isn't filled, the processor will sit idle for about a 20-stage pipeline cycle. Naturally, Intel assembled a predictor that will keep everything running smoothly. That’s where some of those 55 million transistors went.

The Northwood's Level 2 (L2) cache is pretty big at 512 KB. It also typifies much of the Northwood design in that it grabs great big chunks of data. The L2 doesn’t take dainty little 32-KB bites but instead grabs ravenous 64-KB mouthfuls. This is an excellent design if your operations are neatly arranged or if you're dealing with large data sets. However, the design isn't as good when dealing with dynamic and imperfectly optimized applications. (An old programmer once told me that optimizing is more difficult than bug fixing since a bug is just broken; unoptimized code is just slow.)

Memory makers
On the memory front, the Northwood features a front side bus, the interface between the CPU and the rest of the system. The Northwood has a 3.2-GBps connection. This brings us to an external component, RAM. The Northwood can use expensive Rambus DRAM (RDRAM), which can run up to 1.6 GBps per channel, with a maximum of two channels. Double 1.6 GBps and you get 3.2 GBps. Great! The memory bus is full! DDR, on the other hand, is double-speed SDRAM that is available in bandwidths up to 2.1 GBps per channel. In dual channel mode, this would provide 4.2 GBps, or a very inexpensive single channel memory solution with decent bandwidth. Using standard SDRAM at a measly 1 GBps leaves the Northwood gasping for air.

The rest of the CPU is fairly complicated. The Northwood's pair of double-speed arithmetic logic units (ALUs) can rip through those micro-operations at twice the core clock speed. It also has a more complicated ALU to handle additional complex code at the processor’s core speed. Again, on clean, organized applications, those fast ALUs can do a lot of work. But toss a dirty, unoptimized application in, and it’s a different ball game.

While the ALUs handle integer and logic operations, a pair of floating-point processing units (FPUs) gets the complex math done. Intel claims that extra FPUs wouldn’t make a difference with this design. Some people were skeptical. However, one explanation is that there simply isn't enough capacity in the Northwood to keep more processors busy in the long run. A double-clocked ALU will get very data hungry. At 1.4-GHz clock speed, that ALU is running at 2.8 GHz, but ramp up to a 1.7-GHz processor, and you’ve sped that ALU up to a whopping 3.4 GHz.

Compiler delay
The Northwood experienced an overall architecture change. As I’ve pointed out, the Northwood likes large masses of organized data, so it needs a genius compiler to get amazing results. This little detail has been bored out with new compilers from Intel. These compilers managed to double the Northwood’s performance in some functions, which is good since a 1.7-GHz Northwood really doesn’t outrun a 1.4-GHz Athlon XP 1700+. However, it's bad, since those compilers have not yet been widely adopted. Even when they are adopted, it will be about six months before you see the majority of applications based on them. By then, the Northwood should be ramped up to about 3 GHz with plenty of double-clocked ALUs thirsting for optimized code.

Athlon XP architecture
So what do the Athlon XP’s specs tell us? Well, it has a good number of transistors done on an 0.18-micron process. However, at 37.5 million transistors, the Athlon XP does not have nearly as many as the Northwood. Eighteen million transistors is a hefty deficit between the Northwood and Athlon, but if you don’t have five finicky processing units and a huge number of processor stages, you’ve got plenty. But are they enough?

Caching in
When it comes to cache memory, the 37.5 million transistors are enough. The Athlon has a lot of cache memory—eight times as much L1 data cache as the Northwood, which gives the Athlon an immense advantage over the Northwood design. Intel’s philosophy can be boiled down to “do it right or don’t do it today.” AMD's is more like, “Be as flexible as you can without dropping the ball.”

The L1 and L2 cache are exclusive caches; they don’t hold copies of the same data. This maximizes the 256 KB of L2 cache memory available to the processor while minimizing the reliance on the slower frontside bus. After all, none of us really want to sit through that 11-stage pipeline if we don’t have to. It may be half as long as the Northwood’s pipeline, but the Athlon doesn’t have that spiffy predictor to rely on.

Quick memory
The Athlon has a relatively quick front side bus with 2.1 GBps of bandwidth. While not as fast as the 3.2 GBps the Northwood can handle, it does have the advantage of being very responsive. As I said before, Intel’s design moves large chunks of data but has problems when that data isn’t in order. The DDR memory responds to requests far quicker than Rambus and performs better on scattered data or when recovering data missing from the cache.

To take advantage of the available resources, the Athlon XP has a cache look-ahead unit. Similar to systems on the Northwood, this unit allows the Athlon to guess what data will be needed and get it during "quiet” periods. This leaves the remaining bandwidth available for other things, which increases the effective bandwidth.

The cache look-ahead feature is what enables the XP processors to be faster than the previous Athlons and enables the use of the “model number” designations that are slightly higher than the actual clock speeds. An Athlon XP 1700+ is actually a 1.47-GHz processor, but it runs like a 1.7-GHz Athlon sans the cache look-ahead capability.

The processor core itself consists of three integer execution units (IEUs) that handle the same kinds of tasks as the Northwood ALUs. These are akin to the “big slow” ALUs, since they can handle anything tossed at them, but only once per clock cycle. In this respect, the Athlon is far superior to the Northwood at the same clock speeds at running code based on complex operations (i.e., today’s software).

And the winner is…
My vote for the winner is the Athlon XP, by a technical knockout. The Athlon's current performance is extraordinary. Intel is in a seriously unpleasant position in the market, as it must promote an expensive-to-manufacture processor that won't perform well with unoptimized applications. The Intel processor is also transitioning sockets to make planned obsolescence a reality. Further complicating things, it has an established processor that it has to almost make fun of to get people to stop wanting it. Not to mention it has to fend off a hungry company, AMD, that has significantly cheaper components.

The Northwood’s second-class status should last until Intel’s new compilers are completed and have time to be used. Until then, the Athlon should be able to keep the lead as long as the 0.18-micron XP doesn't run out of speed before AMD can release its 0.13-micron chip.

IT pros should weigh their innate desire for speed vs. the needs of their clients. The Northwood systems are widely available from many OEMs, making them an easy choice, but the recent socket change means caution must be exercised to avoid acquiring already obsolete hardware.

Because AMD doesn’t have the deals Intel enjoys, it hasn’t been adopted as much by the larger system builders. However, HP and Compaq do have Athlon XP offerings. What AMD does offer is both single and dual processors that are typically interchangeable even with their “value” Durons, making it an ideal choice for a corporate-wide platform, especially where multiprocessor workstations are in use.

Editor's Picks