Data Centers

SolutionBase: How pipelining and multiple cores help speed CPUs

Caching is one mechanism used for increasing a chip's performance without increasing its clock speed. Here are some other things that Intel and AMD are doing to boost CPU performance.

This article is also available as a TechRepublic download.

For many years Intel and AMD were constantly trying to one up each other by offering the processor with the fastest clock speed. As time went on though, it became more difficult to produce chips designed to run at higher clock speeds because of problems with heat dissipation. Consequently Intel and AMD began looking for other ways to make their chips run faster than relying solely on clock speed. As I explained in my article "What makes a fast CPU fast," caching is one mechanism used for increasing a chip's performance without increasing its clock speed. In this article, I will continue the discussion by talking about some other things that Intel and AMD are doing to boost CPU performance.


Another key factor in the way that a CPU performs is its pipeline. A pipeline is basically a queue into which CPU instructions are fed. As instructions are sent to the CPU, they are placed into the pipeline in sequential order. The instructions within the pipeline are executed in the order that they are received.

The main idea behind pipelining is that allows a CPU to perform more efficiently. As you may know, a CPU contains several different parts. An instruction can only the use one of the CPU's parts at a time. This means that all the other parts of the CPU remain idle as an instruction is executed. As you can imagine, this is a terrible waste of CPU resources. Pipelining allows instructions to be executed in such a way that most of the CPU's components are in use simultaneously.

In order to understand how pipelining really works, you need to know a little bit about the instructions that are sent to a CPU. There are five basic parts to every instruction;

  1. Fetch
  2. Decode
  3. Execute
  4. Memory Access
  5. Write

I could probably write an entire book on these instructions and their implications. For the purposes of this article though I want to keep things simple.

The first part of an instruction is called fetching. Fetching is the process of reading a part of a program from memory. Next comes the decode sequence. Decoding involves examining the code that was retrieved during the fetch to see what kind of action must be performed, and what data is necessary to complete the action. Next comes the execute portion of the instruction. This is where all mathematical operations are performed by the CPU's arithmetic logical unit. The memory access portion of the instruction is not always performed. It is only necessary if the CPU needs to read additional data from memory or write data to memory.

Fetch, Decode, Execute, and Memory Access make up four of the five basic components of a general instruction. The fifth component is Write. I will come back to this one in a little while.

So what does all this have to do with pipelining? As I explained earlier, the various parts of an instruction that I've talked about use different components within the CPU. Pipelining makes CPU access more efficient by ensuring that most of the CPU's components are being used simultaneously.

Pretend for a moment that four instructions have been placed into a CPU's pipeline. The CPU begins working on those instructions by performing the fetch portion of the first instruction. Once the fetch is complete, the CPU can move on to the decode phase of the first instruction.

Keep in mind though that the portion of the CPU that handles the fetch function is now freed up. Therefore, the CPU can begin working on the fetch portion of the second instruction at the same time that it is working on the decode portion of the first instruction. When the CPU is ready to perform the execute portion of the first instruction, the fetch portion of the second instruction is done. Therefore, the CPU can begin working on the decode portion of the second instruction and the fetch portion of the third instruction. If you're a little confused by what's going on, then take a look at Table A.

As you can see in the diagram above, the second, third, and fourth instructions have begun before the first instruction has completed. For this reason, the length of the pipeline can have a tremendous effect on the CPU's performance. Obviously, if the pipeline is too short then CPU resources are wasted. Let me show you what happens if the pipeline is too long though.

Remember that last component to a general instruction that I talked about yet? The Write components job is to write the results of the instruction to a register within the CPU. The Write function is not always used, but it is used when mathematical operations or comparisons are performed. To see why this is important, consider the following two mathematical equations:



Granted, these instructions are simple arithmetic and not machine language code, but they will do fine illustrating my point. If you look at the second line of code, you will see that it cannot be solved until a value for X can be established. The problem is that if these were CPU instructions, the second instruction would have already begun before the first equation's results could be written to the CPU's register. This results in a CPU stall.

A stall as a condition in which instructions can't be processed until an earlier instruction completes. Newer processors contain special forwarding hardware designed to minimize the impact of dependency based equations such as the one that I just showed you. This hardware greatly reduces the need for stalls.

Another problem with pipelining is branching to see why branching is an issue, consider of the lines of pseudo code below:

Set Y=2+X

If Y=6 then do an action else do some other action

As you can see in the pseudo code above, we are telling the CPU to perform an action if Y is equal to six and to perform a different action if Y is equal to anything else. The problem with this type of instruction is that it is conditional. There are two possible instructions that could follow this conditional instruction. When a pipeline is in use this presents a problem because multiple instructions are being processed simultaneously in a staged manner. Because branches are very common in programs, a CPU that uses a pipeline must predict the outcome of the branch instruction. A subsequent instruction would then be placed into the pipeline based on the predicted outcome of the branch.

What if the prediction is wrong though? If the branch instruction's outcome is predicted incorrectly, then subsequent instructions within the pipeline are also incorrect. When this occurs, the pipeline must be flushed. This means that all remaining instructions must be removed from the pipeline, and new instructions must be fed into the pipeline based on the outcome of the branch instruction.

The impact of a pipeline flush greatly depends on the length of the pipeline. If the pipeline is short, then it will only contain a few instructions, so it's no big deal to remove those. If the pipeline is long though, then many instructions will have to be removed, resulting in a significant CPU stall.

Although there are issues with using pipelines, a pipeline can make a CPU more efficient so long as the pipeline is of an appropriate length. If the pipeline is too short, then CPU resources may be wasted. If the pipeline is too long, then a pipeline flush can cause a major delay.

Dual-core CPUs

These days, dual-core CPUs are all the rage. As I'm sure you probably already know, a dual-core processor is a processor that has two logical processors embedded into a single piece of silicon. The operating system treats these two processor cores in the same way that it would treat two physical processors. For example, the computer that I'm writing this article on has a dual-core processor. If you look at Figure A, you will see that Windows thinks that it is running on a machine with two processors.

Figure A

The Windows Task Manager shows two separate processors on a dual-core machine.

In researching this article, I ran into several Web sites claiming that computers with dual-core processors run twice as fast as a machine with a comparable single core processor. This simply isn't true for several different reasons.

A computer with two physical processors does not run twice as fast as a comparable machine with one processor. The reason for this is because of the way that processors work. An application is made up of one or more processes. These processes are in turn made up of one or more threads. A thread is an individual unit of execution. A thread can't be split to run across multiple processors. Therefore, in order for a machine to see any benefit at all from having multiple processors (or a dual-core processor) it must be running multiple threads.

Even if an application is made up of multiple threads, that application will not perform at double the speed if it is run on a computer with two processors. The reason is because a portion of each processor's capacity is lost to system overhead incurred in the task of determining which threads should run on which processor. Typically multithreaded applications will see about a 50% performance gain when run on a computer equipped with two processors.

As with computers with multiple processors, computers with dual-core processors must be running multiple threads in order to see any of the performance benefits associated with having multiple cores. If a multithreaded application is running on a dual-core processor, the application will run faster than it would if it were running on a single core processor. However, the application will not run as quickly as it would on a comparable system with two physical processors.

Like a machine with two physical processors, a dual-core processor is equipped with two pipelines, and two sets of caches. Unlike a machine with two physical processors though, both cores share the same interface to the system bus. On a machine with two physical processors, each processor can access the system bus independently. Because of this, a dual-core system will perform similarly to a comparable machine with two physical processors so long as it does not have to access the system bus.

So what's the big deal about the system bus? In the first part of this article series, I explained that cache memory is a lot faster than a system's RAM. The reason for this has to do with the system bus. Cache memory is integrated into the CPU. This means that the cache memory runs at about the same speed as the CPU does. If the CPU's clock speed is 3.2 GHz for example, then the bus to the CPU cache will also run at 3.2 GHz. By comparison, there are many different speeds of system buses, but a system bus speed of 400 MHz is fairly common. In this example, the CPU cache bus is about eight times faster than the system bus.

As you can see, the system bus presents a bit of a bottleneck. That's why CPU caching makes such a big difference in a system's performance. The system bus bottleneck is amplified on dual-core systems though. Because the two processor cores share a connection to the system bus, the two cores can't both communicate with the system bus simultaneously. Instead, each core has to wait for the other core to finish using the system bus before it can communicate across the bus. This is why dual-core processors are not as fast as separate physical processors.

Even though there are relatively few multithreaded applications (compared to single threaded applications), dual-core processors are still desirable even if you don't run multithreaded applications. A dual-core processor's capabilities can also be realized by those who multitask. For example, right now I am burning a DVD in the background while I am writing this article. Even though Microsoft Word and my DVD burning software are both single threaded, multiple threads are running because I'm running multiple applications and therefore taking advantage of multiple CPU cores.

More than just MegaHertz

As you can see, there are a lot of factors that affect a CPUs performance. Although clock speed used to be the predominant factor, it does not mean as much as it used to. For example, AMD processors run at a slower clock speed than comparable Intel processors, but perform comparably because they have a shorter pipeline. When selecting a processor, you should look at not only the clock speed, but also the size of the level 1 and level 2 cache and to the number of logical cores.

Editor's Picks

Free Newsletters, In your Inbox