Categories

Home
Inside Your Computer
Tech Help Center
Windows Tips
PC FAQ's
About Us
Contact Us
Search Us
Tech News

      Apple iPhone
Apple iPhone Logo
   Learn More Today!

Hardware Spotlight

building a computer

Build Your Own Computer or Buy?

How To Maintain Your Computer!

Warning Signs of a Computer Breakdown!

Laptop or Desktop - The Debate Continues!

Tips on using Bluetooth Enabled Printers!


Worth a Look!

How I Lost $5,500 on the Internet!

The Digital TV Deadline | How to get ready for it

How To Be Safe When Banking Online!

Appreciate the New Functionalities of Microsoft Office 2007!

Learn How To Type Faster!

7 More things you should know about Microsoft Word!

Back-up Important Data

More Great Articles...

Image of a Stick man Pointing to a Link  For Your Reading Pleasure...

E-Mail Speak

BTW - by the way
BRB - be right back
TYVM -thank you very much
YW - you're welcome
LOL- laughing out loud
L8R - later
NRN - no reply necessary
SOS - same old stuff

AYPI? - and your point is?
ATM - at the moment
BAC - back at computer
B4 - before
BBFN - bye bye for now
BC - be cool
GMTA - great minds think alike
HTH - hope this helps
view more E-Mail Acronyms?

Other Resources

COLLEGE TIDBITS

Thinking About College?

The Savvy Woman
A FREE Magazine for Women

 

[Up] [Next]

1 March, 2000

CPU and Memory Speed

A computer executes instructions. Each instruction tells the computer to add, subtract, multiply, divide, or compare two numbers to see which is larger. The rest is housekeeping.

Even an application that appears to do no arithmetic is still using numbers. Consider the automatic error correction in a word processor. If you type in " teh " the computer appears to recognize the common misspelling and changes it to " the ". What does this have to do with numbers? Well, every character that you type, including the space bar, transmits a code to the computer. The code is ASCII, and in that code a blank is 32, "a" is 97 and "z" is 122. So the computer sees " teh " as the sequence 32 116 101 104 32. The word processor has been programmed to check for this sequence, and when it sees it it exchanges the 101 and 104. The CPU chip doesn't know about spelling, but it is very fast and accurate handing numbers.

If everything is numbers, you might think that the speed of the computer is largely determined by how fast it can add. People expect this because adding large numbers takes us a long time. Ask someone how much is 2+2 and they will respond immediately 4. Ask how much is 154373 + 382549 and they will stop for a minute and take out a pencil. A computer adds numbers with electronic circuits that work just as fast for large or small numbers. Arithmetic is what computers do best, and they do it almost instantly.

A CPU is measured by how many instructions it can process in a second, not by how long it takes to process any single instruction. Consider a fast food counter. They have a bunch of lines, several people working the counter, and lots of people in back cooking the food. They measure themselves by how many customers they serve in any period of time. When you come to the the front of the line, the item you want may be temporarily unavailable and you have to step aside. It might take you unusually long to get your burger, but lots of other people are being served during the period. To you, the service is slow. To the business, they are moving lots of people through.

In the same way, a CPU is designed to fetch programming, fetch data, and execute instructions. Sometimes a particular instruction needs data that is not immediately available. All modern processors can push the instruction aside and have it wait while subsequent instructions are serviced. Speed is measured by the overall throughput of the chip.

Clock Speed: Tell Me When it Hertz

Computer performance is a traffic problem, moving data and instructions from memory and around inside the chip. Most people think of "traffic" in terms of cars and highways. However, there is a more relevant traffic analogy that everyone experienced before they learned to drive.

Students have been sitting in class for a long time. Finally the bell rings throughout the school signaling the end of the current period. Everyone gets up and moves through the hall to their next classroom. After a few minutes the bell rings again to signal the start of the next period. The bell has to ring everywhere in the school at the same time to coordinate movement. If each teacher decides independently when a period is over, then students will frequently arrive at the next room and find it filled with the previous class.

The various parts of a computer hold instructions and data. Periodically they send this data along wires to the next processing station. To coordinate this activity, the computer provides a clock pulse. The clock is a regular pattern of alternating high and low voltages on a wire. The clock speed is measured in Megahertz. One Megahertz (1 MHz) is a signal that alternates between high and low values one million times a second. A 66Mh PC has a clock which "ticks" and "tocks" 66 million times each second. Each tick-tock sequence is called a cycle. The clock pulse tells some circuits when to start sending data on the wires, while it tells other circuits when the data from the previous pulse should have already arrived.

The earliest PC had one clock, and its signal applied to the CPU, memory, and all the I/O devices. A modern PC has many different clock signals for different areas of the machine.

  • The CPU and memory receive a clock pulse of 66, 100, or 133 MHz from the mainboard. Intel doesn't currently like the 133 MHz speed for memory and has a few motherboards that give a 133 MHz clock to the CPU and a separate 100 MHz clock to the memory.
  • Inside the CPU a faster clock rate is generated synchronized to the external clock. For example, if the CPU runs an internal clock pulse that is 5 times as fast as a 100 MHz main clock, then the CPU rate is 500 MHz.
  • The PCI I/O bus runs at 33 MHz. Typically this is accomplished by dividing the 66 MHz clock in half, or the 100 MHz clock by a third.

If you buy a mainboard manufactured by Intel, or a system from a well known corporate supplier like IBM or Dell, then the clock rate will be fixed at the official standard value for that type of CPU. However, the clock rate can typically be set to other values, and most third party equipment supplies provide boards where the clock rate can be configured from the power-up setup panels. When the customer increases the clock rate beyond the standard values, this is called "overclocking".

Intel designs a particular CPU chip to run at a range of speeds. When they finish with the design and engineer the fabrication process, they have spent a couple of billion dollars. To be safe, they probably design in a slop factor just in case something is slightly wrong. Then they start to manufacture chips. A lot of the chip price goes to recover the original design and setup cost.

Intel also sets a price range for the various speeds based on what they expect the customers will be willing to pay. However, if they have done their engineering right, each chip is probably a little better than it needs to be. Sometimes Intel has a shortage of the lower priced chips and a surplus of the faster chips, and the easiest way to solve the problem is to remark the better chips to the lower price. Because of this, home users are often successful in overclocking the lowest speed chips in any given family.

The Celeron processor is supposed to run at a 66 MHz clock rate, but it can often run successfully at a 75 or 83 MHz rate. The problem is that this also speeds up the PCI bus beyond its official 33 MHz speed and some of the adapter cards won't work correctly. You also have to put faster memory in the machine so it can keep up with the overclocking. 

You can spend a lot of time configuring, testing, and reconfiguring a system. If you have any real work to get done, overclocking usually isn't worth the effort. People do it because it as a hobby. If it makes you feel like an outlaw, take a walk on the wild side.

The only unambiguous measure of advanced technology is the width of circuits in the CPU chip. Smaller circuits draw less power, generate less heat, and can run at faster clock speeds. The first Pentium II processors were built on .35 micron circuit sizes and were powered by 2.8 volts. That was quickly replaced by .25 micron circuits running first at 2.2 volts and then 2.0 volts. The current state of the art is .18 micron circuits powered at 1.6 volts. The changes in size and power explain why it is not possible to simply replace an old 233 MHz Pentium II processor with a new 733 MHz Pentium III. Plug a 1.6 volt chip into a 2.8 volt socket and it will burn out.

Nanoseconds

All the ads and specifications quote clock speed in Megahertz. However, the more important number is the length of time between clock ticks (the cycle time). Such periods are usually measured in nanoseconds (billionths of a second) abbreviated "nsec."

Electricity travels through a copper wire just a bit slower than the speed of light. Normally, we can just regard the speed of light as "very fast." It becomes important when the distances are very long (astronomy) or when the times are very short (computers). A nanosecond is the amount of time that it takes light (or an electric signal) to travel about one foot.

PC clock speeds appear at first to be a strange collection of numbers. However, the corresponding cycle types display a much more regular pattern:

        Clock   Cycle
        66Mh    15 nsec
        100Mh   10 nsec
        500Mh    2 nsec

A processor with a 500 MHz clock must perform operations in less time than it takes for electricity to travel 2 feet. The chip is very small, but it has millions of circuits. All must be manufactured to a very high level of precision.

However, it is much simpler to apply quality control to a chip the size of a fingernail than to the entire mainboard. This by itself show the problems of a higher speed main clock, and the benefit of capping the I/O bus design at 33 MHz (30 light-feet of signal distance).

Instructions per Cycle: Get in Gear

To add up a column of numbers with a pocket calculator, you simply type each number in and press the "+" key (or the "=" key at the end). Most users probably think that a PC spreadsheet program does the same thing. However, the human brain has actually been doing the hard part of the operation, moving down one row in the column, focusing on the number, and recognizing it. Each PC instruction carries with it a number of additional operations that would not be obvious to the casual user.

First, the computer must locate the next instruction in memory and move it to the CPU. This instruction is coded as a number. The computer must decode the number to determine the operation (say ADD), and the size of the data (say 16-bits). Additional information is then moved and decoded to determine the location in memory (the row and column of the spreadsheet). Finally, the number is added to the running total. Although a human might take some time to add two eight digit numbers together, the addition is the simplest part of the operation for a computer chip. Decoding the instruction and locating the data take the most time.

Each generation of Intel CPU chip has performed this operation in fewer clock cycles than the previous generation.

  • A 386 CPU required a minimum of 6 clock ticks to add two numbers.
  • A 486 CPU can generally add two numbers in two clock ticks.
  • A Pentium CPU can add two numbers in a single clock tick.
  • A Pentium II, III, Celeron, or Xeon can add two numbers in a single clock tick. If it discovers that the next instruction needs data that hasn't arrived from slow memory, it can rearrange things to execute subsequent instructions until the data arrives. 

To make a car go faster, one steps on the accelerator. Extra gas makes the engine rotate faster. When RPM gets high enough, it is better to shift to a higher gear. The PC system clock (measured in MHz) is like the engine speed (measured in RPM). The CPU model selects the gear. The original 86 processor was like first gear, and the 486 is like fourth gear. So it is a mistake to compare clock speed across changes in the architecture.

The High School Analogy

The first generation of PC CPU chips was like a one room schoolhouse. A class of students could enter and be seated. The first period would be English. When the bell rings, they switch books and take a period of Math. Then History, a Language, and finally Science. When all the subjects are done, they get up, leave the school, and another class can enter, sit down, and take their classes.

If you want the school to educate students more efficiently, you could try to shorten the periods (speed up the clock). However, you can also speed up things by building more classrooms. That is what happened with the 286, 386, and 486 generations of chips. In a school designed like a 486, there is one classroom for each subject. When the bell rings, the students in the English room move to Math, the Math students move to History, and so on. The students in the last class, Science, leave the school. A new class enters and sits down in the English classroom to begin their sequence of subjects.

Each new generation of chips typically triples the number of circuits of the previous generation. So the fifth generation chip, the Pentium, added a complete second set of classrooms. Now two classes would take each subject at the same time.

Dependent Instructions

If the Pentium High School has two English (first period) classrooms, then every time the bell rings two new classes of students can begin studying. In a real CPU chip, this means that with every tick of the internal clock two new instructions can begin execution. However, occasionally one computer instruction depends on the results of the previous operation. For example, to find the average of two numbers you first add them together and then divide the sum by two. If the ADD and DIVIDE instructions are both waiting to execute, they cannot both begin execution at the same time. The ADD has to start first, because only after the sum has been calculated can the DIVIDE be executed.

The Pentium chip checks each pair of instructions as they are about to execute. If the two instructions are independent (that is, if the second instruction does not use the results of the first instruction) then both can start running at the same time. If the second depends on the previous instruction it is held up, and in that clock tick only the first instruction begins execution (in the analogy, one of the English classrooms is empty). The next time the clock ticks again, the second instruction (and maybe the one after it) will begin execution and the previous instruction will advance to the next phase of processing.

Memory Access Delay

An ADD instruction adds two numbers together. In order to execute this instruction, the CPU first has to get the instruction itself from memory and then fetch one or both numbers. To speed processing, every modern CPU chip has two types of internal memory. This Cache memory holds the most recently used sections of programs and data.

The best type of internal memory is the Level 1 (L1) cache. This memory is part of the CPU core along with the units that decode instructions and perform arithmetic. If the instruction and data are in L1 cache then the CPU can execute at full speed. The modern Intel processors have 32K of L1 internal cache. Competing processors from AMD and Via have more L1 cache.

When the instruction or data is not found in the L1 cache, modern processors have a larger amount of Level 2 cache either integrated into the CPU chip or mounted with the CPU chip inside the processor cartridge. The Pentium II had 512K of L2 cache memory operating at half the speed of the CPU. The Celeron and newer Pentium III chips have 128K and 256K respectively of L2 cache in the CPU chip operating at full processor speed.

If the processor needs an instruction or data residing in the L2 cache, then it must wait 2 to 4 cycles for the L2 cache to supply that data. Often the CPU can find other instructions pending execution that are not blocked by the missing data that can be executed while waiting. 

The main memory, however, is Dynamic Random Access Memory (DRAM). DRAM is much slower than the CPU or any of its caches. A 500 MHz CPU clock has a cycle time of 2 nanoseconds, but DRAM takes a minimum of 60 nanoseconds to respond with new data. No matter how clever a Pentium III chip may be about juggling the order of execution, it will eventually have to wait for the data and miss the opportunity to execute a substantial number of the 60 instructions it could have processed during the period.

RISC Architecture

The first Intel "CPU on a chip" was the 4004 processor. It was more like a pocket calculator than a real computer. It handled ordinary base 10 digits encoded as four bits. Later chips added the ability to handle 8 bit, 16 bit, and 32 bit numbers. So on a modern Intel CPU chip there is no single Add instruction. Instead, there are separate Add operations for digits, bytes, and every other size of number. The resulting set of possible instructions is a mess. This is typical of a "Complex Instruction Set" computer chip.

In your Sunday paper, right next to the CompUSA insert there is probably something from Sears. Look at the last few pages of the ad, where they show the tools. There will almost certainly be a picture of the traditional "190 Piece Socket Wrench Set." If you purchased this item, you would always have the right tool for any job. In reality, it is almost impossible to keep all the pieces organized, and you will spends minutes searching through all the attachments to find one of the right size.

Go to a tire store. They lift your car off the floor, remove the hubcaps, and then pick up a gun shaped device connected to a hose. "Zuuurp" and each bolt comes off the wheel. You could do the same thing with the 190 Piece Socket Wrench Set, but every garage knows that automotive wheel bolts come in only one size. So they don't have to spend time searching for the right size tool, and they can optimize the one size that they really need.

When computer designers realized the same thing, it was called Reduced Instruction Set Computers or RISC. Make all the instructions the same size. Use only one size of data. Simplify the instructions and therefore the operation decode. Then use all the room on the chip to optimize what is left, rather than filling the chip with support for instructions that are seldom executed.

Two or three years ago it was possible to argue that the future belonged to RISC computers. Given the technology available at the time, they were smaller, faster, cheaper, and easier to build than conventional computer chips. In a joint project, IBM, Apple, and Motorola developed the PowerPC chip and Apple proceeded to convert its entire Macintosh line to use it. DEC developed its family of Alpha CPUs, and Sun has its SPARC family.

Then all the other vendors sat back and waited for the Intel architecture to hit the dead end that they all predicted was inevitable. However, there is a funny thing about silicon. Technology doubles the power of chips every 18 months, and there are economies of scale when you are selling millions of chips every month.

The advantage of a Reduced Instruction Set turned out to be important in the period when chips have 2-3 million transistors (during the period of the late 486 chips and the early Pentium chips). When the PowerPC was first announced, it was billed as having "the power of a Pentium at the price of a 486." Due to software problems, IBM delayed any wide distribution of PowerPC systems, and the window of opportunity was lost. By the time that any systems other than Apple used the PowerPC, Pentium chips were selling for less than the 486 chips used to cost, and the Pentium Pro combined the best of RISC and conventional chip design.

RISC chips are still widely used in Unix systems, and they can run Windows NT. It is likely that some type of RISC multiprocessor will remain the most powerful choice for a dedicated file and database server when raw power is important and price is less important. However, at this time it appears that Intel is unstoppable and that RISC systems will never capture a significant share of the desktop or laptop market.

Superscalar, Pipeline, and Multimedia

Although a tire store may be fast at changing tires, when you really need speed look at how they do things in Indianapolis. A race car pulls into the pit for service. They jack it off the ground, and then four teams of mechanics go to work on all four wheels simultaneously. The car is back in the race in a matter of seconds. In ordinary life, such service would be prohibitively expensive. But in the world of microelectronics, transistors are cheap.

A pipeline is the sequence of processing stations that decode instructions, fetch data, perform the operation, and save the results. Inside the CPU, instructions are processed at a sequence of stations that resemble an assembly line. Memory is pipelined when the CPU can request a sequence of addresses one per cycle and then after a delay of typically four or five cycles the memory responds with the data in the order in which it was requested. 

A computer is superscalar when it can execute more than one instruction per clock cycle. Since the Pentium chip, Intel processors have been able to execute two instructions per cycle overall.

A CPU is typically a general purpose device. It can perform any type of calculation equally well. There are, however, certain special calculations which occur frequently enough to justify special support.

If you buy an audio CD in a record stored, the music is encoded onto the disk as digital data. Sound is a regular vibrating difference in air pressure. The data on the CD represents a regular sample of air pressure measured one or more microphones. When it is played, a stereo system reproduces those variations in air pressure by vibrating the cones contained in the speakers.

If you listen carefully, you can pick out one instrument in an orchestra or the voice of one person from the background noise of a cocktail party. In technical jargon, this is called "signal processing". Before computers, signal processing was done with filters that blocked certain sounds or frequencies. Today, audio and video signals can be processed by a special computation.

There are a family of applications. Used one way it can take the pops and scratches out of an old recording. In another case, it can be used to colorize old black and while movies. It is also used to compress the video signals in direct satellite broadcast.

It is possible to buy specialized computer chips that do only digital signal processing. For $15 you can get a chip that does hundreds of millions of instructions per second. Such chips are built into modems and some sound cards.

The problem isn't the cost of the chip, having the room to mount it inside a PC. Intel has begun to add special support for the signal processing computation with a set of multimedia instructions inside their CPUs. Although such computations could be done using normal instructions, the extra hardware support produces a particularly high level of superscalar performance as many calculations can be performed at the same time.

Memory Architectures

The newspaper ad offers a computer system with a 500 MHz processor and 64 Megabytes of RAM. In other words, we are interested in how fast a CPU is, but we measure the amount of memory, not the speed.

There are lots of different CPU speeds, but DRAM is a commodity item sold in bulk and all the vendors produce essentially the same technology. Since the motherboards are all configured to support industry standard parts, there is no advantage if a vendor produces memory that is, say, 5% faster than the standard.

Modern Synchronous DRAM has two performance numbers. The first is latency, the delay between the time that a particular data item is requested and the time when the memory can reliably transmit the data back to the CPU. Modern DRAM typically has a 50 nanosecond latency. It is important to remember that latency is measured at the memory chip. Before the chip can begin, the signal has to exit the CPU, be processed by the memory controller, and run down the wires on the mainboard. Exact numbers on this additional mainboard overhead delay are not easily available, but they are substantial.

The second performance number is throughput, the rate at which SDRAM can return additional data from the same general area of memory. This is represented by a memory clock rate current SDRAM parts are PC66, PC100, and PC133 for a 66, 100, and 133MHz memory bus clock rate. The most popular current speed is PC100 memory, which at a rate of 100 MHz can return data every 10 nanoseconds.

So in a system with a 100 MHz "frontside" bus, to transfer a chunk of 32 bytes of data down the 8 bit bus, the CPU will send out the address and wait a minimum of 5 memory bus clock cycles (50 nsec latency) plus maybe an extra cycle or two for the mainbus overhead delay. Then it will receive 8 bytes of data every memory bus clock cycle for the next four cycles. That represents a minimum of 9 100 MHz clock cycles to complete the entire 32 byte transfer.

Wider Bus

For the last 10 years, the only way to increase memory speed was to widen the memory bus. Since conventional memory parts hold and transfer 8 bytes of data in a memory bus clock cycle, adding another independent row of standard memory chips creates a second memory path that can transfer another 8 bytes. This brute force solution now doubles the memory throughput to 16 bytes every 10 nanoseconds.

DDR

A more sophisticated and less expensive solution is available with something called Double Data Rate or DDR memory. DDR requires a small change to current memory design. Each memory part (each DIMM) is populated with chips that will deliver twice as many bits (16 bytes of data) in response to each request. However, some additional logic on the DIMM presents only the first 8 bytes of data and holds back the second 8 bytes of data for a half a clock tick.

Every computer clock is represented by a signal where the voltage goes to a high value for half the clock time (the "tick") and then goes to a low value for the other half of the clock time (the "tock"). Data is normally transferred only at the moment when the voltage rises from its low to its high value (at the start of the "tick"). However, with a bit more logic, you can also transfer data when the voltage drops from the high value to the low value. This allows a 100 MHz clock to transfer data twice per clock cycle, matching the speed that traditionally required a 200 MHz clock. DDR SDRAM transfers the second 8 bytes when the clock signal drops, producing the same performance boost as adding a second memory bus but without the extra wires or sockets.

Rambus (RDRAM)

The Intel strategy backed a new, powerful, but very expensive memory technology invented by Rambus. Conventional memory up to and including SDRAM is pretty dumb. It sits there till the CPU presents an address, then after the latency period it responds with the data. There is no processing power or control logic on the memory chip.

Rambus DRAM (RDRAM) puts processing power on the memory chip and then connects the memory to the system with a more sophisticated bus. Instead of the conventional 64 bit (8 byte) path, RDRAM transfers only two bytes of data every clock tick. However, it runs the bus using a 600, 700, or 800 MHz clock. So after a typical 10 nsec period in which the SDRAM 100 MHz clock ticks once and transfers 8 bytes, the 800 MHz RDRAM clock ticks 8 times, and transfers 2 bytes per tick, for a total of 16 bytes or twice what SDRAM delivered.

RDRAM is based on the same basic memory technology used in SDRAM. The chip technology is limited to delivering a new unit of data every 10 nsec (corresponding to 100 MHz SDRAM). So how can RDRAM deliver data at 800 MHz? The trick here is that RDRAM doesn't manage all of the memory chips according to a single clock. Instead, chips are grouped into logical "devices". Each "device" can deliver two bytes of data. The RDRAM bus staggers the processing on the chips so that each device starts 1/8 of a clock tick after the previous device and is therefore ready to deliver data 1/8 of a clock tick later. Although no single chip is delivering data more than once every 10 nsec., the entire array is delivering data eight times as fast.

It should be noted that RDRAM also transfers data on both the rise and fall of the clock signal. Therefore, although we talk about "800 MHz" as the memory speed, the actual memory bus clock is running at 400 MHz with a data transfer at every "tick" and again at every "tock". 

The SDRAM bus is idle during the latency period. RDRAM is fully pipelined, overlapping the latency of the next memory request with the transfer of the previous request. It claims to keep the bus 95% busy, generating what is claimed to be 300% more data throughput than SDRAM.

The bad news is that RDRAM today costs about 300% more than the same amount of SDRAM. This is not a good idea for the casual home or business desktop user. RDRAM is likely to remain, at least for some time, a special feature of high end engineering and media workstations and database or Web application servers.

The Burst

It would be inefficient to track data byte by byte. To optimize the cache, the CPU references data in 32 byte units each starting at an address that is evenly divided by 32. When an instruction or data is found to be in one of these blocks of memory, the entire 32 bytes are read by the CPU and are stored in the L1 and L2 cache. 

Since the memory bus is eight bytes wide, it takes 4 memory bus clock cycles to complete the transfer. This is called a burst. Once Intel established this cache architecture, the memory vendors could optimize memory designs for the processing of the burst.

The CPU begins by presenting the address of some data to the memory controller. It must then wait for a period equal to the latency of the memory (typically more than 50 nsec) before reading any response. The latency is inherent in the way DRAM is constructed and there are no tricks that can improve it.

Back in the days of the early 486 chips, the CPU had to wait the full latency period for every memory transfer. Then a sequence of improvements (fast paged, EDO, and finally SDRAM) were able to reduce the delay for the second, third, and fourth memory value in the burst sequence. So to get 32 bytes of data from modern SDRAM, the CPU may have to present the address, wait 50 nsec (5 ticks of a 100 MHz bus) without getting any data, and then get 8 bytes of data in each of the next four 10 nsec periods.

Cache in your Chips

Just how much cache does a CPU need? Not surprisingly, the answer depends on what you use it for. Remember that the 486 was designed with only 8K of cache memory, yet that was enough so that more than 90% of potential memory references were resolved by data located in the cache. With cache, a little goes a long way.

All Intel processors have 32K of L1 cache. The L2 cache is:

Processor L2 Size L2 Speed compared to CPU
Celeron 128K Full
Pentium III (Coppermine) 256K Full
Pentium II ,other III 512K Half
Xeon up to 2M Full

If you run a word processor, spreadsheet, or Web browser then even the Celeron has enough cache. Additional cache is needed for unusual computationally intensive operations, like Photoshop or other forms of media compression. Intel designed Xeon processors with large cache memory to support multiprocessor servers running database or other memory intensive operations.

All other things equal, more cache is better than less. Clearly no desktop user is going to blow $3800 to get a Xeon processor that, for ordinary applications, will be almost indistinguishable from a $150 Celeron. Just because a Pentium III system is within reach, is it really worth the money?

Vendors have an incentive to sell more expensive units. Some customers will opt for the more expensive machine because they think they're worth it. However, most casual users will get along quite nicely with a Celeron. The Pentium III is engineered best for a workstation or server with two CPUs. If it ever makes sense, the Xeon is designed for corporate servers with 4 or 8 processors.

Athlon

AMD also makes CPU chips with compatible instructions that compete with Intel. Their high end chip is called Athlon, and it competes directly with the best Intel processors. Generally, Athlon chips have more L1 and L2 cache than Intel and they can execute more instructions per clock tick. So at any given clock speed (say 800 MHz), an Athlon chip will be measurably faster than an Intel Pentium III and probably cost a little less money.

The problem with Athlon has been that mainboard logic has been unable to match the features of the CPU. For example the Athlon processor memory bus operates like DDR memory. It transfers data on both the rise and fall of the clock. Therefore, while the Athlon memory clock runs at 100 MHz, it has the data throughput of a 200 MHz system. To feed this CPU at full speed, a motherboard vendor would have had to put in a second parallel memory bus (which is complex and expensive). So the Athlon has been running with under performing memory configurations. Once DDR SDRAM memory becomes widely available, it will exactly match the Athlon design and this AMD chip may significantly out perform Pentium III systems.

Willamette

Sometime before the end of 2000, Intel will come out with its next generation 32 bit CPU chip. We are not talking about some minor update, like the transition from Pentium II to Pentium II. This is a whole new design. In addition to faster clock speeds, larger cache, and more internal parallel processing, the Willamette chip will have the equivalent of a 400 MHz memory bus clock. Thats three times the memory throughput of the best current Intel CPU chips (with a 133 MHz memory bus) and twice the speed of current Athlon chips.

The Willamette chip will have a memory bandwidth that finally matches the best current mainboard. Intel's 840 mainboard chipset supports two parallel RDRAM memory buses with an aggregate memory throughput of 3.2 gigabytes per second, and that is exactly the throughput of the Willamette chip.

Summary

Memory Chart

Type Width
(bytes)
Clock Multiplier Effective
Clock
Bytes/10 nsec Performance
PC100 SDRAM 8 100   100 8 100%
DDR SDRAM 8 100 x2 200 16 133%
RDRAM 600 2 300 x2 600 12  
RDRAM 800 2 400 x2 800 16 300%

Huh! Why is it that DDR SDRAM appears in all but the last column to have the same performance as RDRAM, but then it only gets a 33% boost while RDRAM is 300% better? The answer is in the latency period. With ordinary and DDR SDRAM, the memory sits idle from the cycle when the CPU requests data to the end of the latency period when the data transfer can begin.  Consider a simple chart where the latency is four cycle and the data transfer (for 100 MHz SDRAM is also four cycles). Doubling the speed of data transfer drops the second part from four cycles to two, but it leaves the latency unchanged. Therefore, the burst drops from 8 cycles (4+4) to six cycles (4+2). That improves the throughput of the system to 8/6 or 133%.

However, RDRAM overlaps latency with the transfer of previous memory requests. It can use as much as 95% of the total theoretical bus bandwidth. That's 380% of PC100 SDRAM, but RDRAM advocates round that down to 300% just to not overstate the case. 

Processor Memory Bus

Type Width Clock Multiplier Effective
Clock
Pentium III 8 100   100
Pentium III 8 133   133
Athlon 8 100 x2 200
Willamette 8 100 x4 400

Note that the Athlon is available now (although mainboards are not of equal quality) but the Willamette will not be out until the end of the year.

Continue Back 

Copyright 1996 PCLT -- Introduction to PC Hardware -- H. Gilbert