Computing power is increasing so rapidly, anybook on the subject will beobsolete before it is published. It's an author's nightmare! The original IBM PCwas introduced in 1981, based around the 8088 microprocessor with a 4.77MHz clock and an 8 bit data bus. This was followed by a new generation ofpersonal computers being introduced every 3-4 years: 8088 → 80286 → 80386→ 80486 → 80586 (Pentium). Each of these new systems boosted thecomputing speed by a factor of aboutfiveover the previous technology. By1996, the clock speed had increased to 200 MHz, and the data bus to 32 bits. With other improvements, this resulted in an increase in computing power ofnearlyone thousand in only 15 years! You should expectanother factor of onethousand in thenext 15 years.
The only way to obtain up-to-date information in this rapidly changing field isdirectly from the manufacturers: advertisements, specification sheets, pricelists, etc. Forget books for performance data, look in magazines and your dailynewspaper. Expect that raw computational speed will more than double eachtwo years. Learning about the current state of computer power is simply notenough; you need to understand and track how it is evolving.
Keeping this in mind, we can jump into an overview of how execution speed islimited by computer hardware. Since computers are composed of manysubsystems, the time required to execute a particular task will depend on twoprimary factors: (1) the speed of the individual subsystems, and (2) the time ittakes to transfer data between these blocks. Figure 4-5 shows a simplifieddiagram of the most important speed limiting components in a typical personnelcomputer. TheCentral Processing Unit (CPU) is the heart of the system. Aspreviously described, it consists of a dozen or so registers, each capable ofholding 32 bits (in present generation personnel computers). Also included inthe CPU is the digital electronics needed for rudimentary operations, such asmoving bits around and fixed point arithmetic.
More involved mathematics is handled by transferring the data to a specialhardware circuit called amath coprocessor (also called anarithmetic logicunit, orALU). The math coprocessor may be contained in the same chip as theCPU, or it may be a separate electronic device. For example, the addition oftwo floating point numbers would require the CPU to transfer 8 bytes (4 foreach number) to the math coprocessor, and several bytes that describe what todo with the data. After a short computational time, the math coprocessor wouldpass four bytes back to the CPU, containing the floating point number that is thesum. The most inexpensive computer systems don't have a math coprocessor,or provide it only as an option. For example, the 80486DX microprocessor hasan internal math coprocessor, while the 80486SX does not. These lowerperformance systems replacehardware withsoftware. Each of the mathematicalfunctions is broken into

elementary binary operations that can be handled directly within the CPU. While this provides the same result, the execution time is much slower, say, afactor of 10 to 20.
Most personal computer software can be used with or without a mathcoprocessor. This is accomplished by having the compiler generate machinecode to handle both cases, all stored in the final executable program. If a mathcoprocessor is present on the particular computer being used, one section of thecode will be run. If a math coprocessor is not present, the other section of thecode will be used. The compiler can also be directed to generate code for onlyone of these situations. For example, you will occasionally find a program thatrequires that a math coprocessor be present, and will crash if run on a computerthat does not have one. Applications such as word processing usually do notbenefit from a math coprocessor. This is because they involve moving dataaround in memory, not the calculation of mathematical expressions. Likewise,calculations involving fixed point variables (integers) are unaffected by thepresence of a math coprocessor, since they are handled within the CPU. On theother hand, the execution speed of DSP and other computational programs usingfloating point calculations can be an order of magnitude different with andwithout a math coprocessor.
The CPU and main memory are contained in separate chips in most computersystems. For obvious reasons, you would like the main memory to be verylarge and very fast. Unfortunately, this makes the memory very expensive. Thetransfer of data between the main memory and the CPU is a very commonbottleneck for speed. The CPUasks the main memory for the binaryinformation at a particular memory address, and then mustwait to receive theinformation. A common technique to get around this problem is to use amemory cache. This is a small amount of very fast memory used as a bufferbetween the CPU and the main memory. A few hundred kilobytes is typical. When the CPU requests the main memory to provide the binary data at aparticular address, high speed digital electronics copies asection of the mainmemory around this address into the memory cache. The next time that theCPU requests memory information, it is very likely that it will already becontained in the memory cache, making the retrieval very rapid. This is basedon the fact that programs tend to access memory locations that are nearbyneighbors of previously accessed data. In typical personnel computerapplications, the addition of a memory cache can improve the overall speed byseveral times. The memory cache may be in the same chip as the CPU, or itmay be an external electronic device.
The rate at which data can be transferred between subsystems depends on thenumber of parallel data lines provided, and the maximum rate that digitalsignals that can be passed along each line. Digital data can generally betransferred at a much higher rate within a single chip as compared totransferring data between chips. Likewise, data paths that must pass throughelectrical connectors to other printed circuit boards (i.e., a bus structure) will beslower still. This is a strong motivation for stuffing as much electronics aspossible inside the CPU.
A particularly nasty problem for computer speed isbackward compatibility.When a computer company introduces a new product, say a data acquisitioncard or a software program, they want to sell it into the largest possible market. This means that it must be compatible with most of the computers currently inuse, which could span several generations of technology. This frequently limitsthe performance of the hardware or software to that of a much older system. For example, suppose you buy an I/O card that plugs into the bus of your 200MHz Pentium personal computer, providing you with eight digital lines that cantransmit and receive data one byte at a time. You then write an assemblyprogram to rapidly transfer data between your computer and some externaldevice, such as a scientific experiment or another computer. Much to yoursurprise, the maximum data transfer rate is only about 100,000 bytes persecond, more thanone thousand times slower than the microprocessor clockrate! The villain is the ISA bus, a technology that isbackward compatible to thecomputers of the early 1980s.
Table 4-6 provides execution times for several generations of computers. Obviously, you should treat these as very rough approximations. If you wantto understandyour system, take measurements onyour system. It's quite easy;write a loop that executes amillion of some operation, and use your watch totime how long it takes. The first three systems, the 80286, 80486, and Pentium,are the standard desk-top personal computers of 1986, 1993 and 1996,respectively. The forth is a 1994 microprocessor designed especially for DSPtasks, the Texas Instruments TMS320C40.

The Pentium is faster than the 80286 system for four reasons, (1) the greaterclock speed, (2) more lines in the data bus, (3) the addition of a memory cache,and (4) a more efficient internal design, requiring fewer clock cycles perinstruction.
If the Pentium was a Cadillac, the TMS320C40 would be a Ferrari: less comfort,but blinding speed. This chip is representative of several micro-processorsspecifically designed to decrease the execution time of DSP algorithms. Othersin this category are the Intel i860, AT&T DSP3210, Motorola DSP96002, andthe Analog Devices ADSP-2171. These often go by the name:DSPmicroprocessor, orRISC (Reduced Instruction Set Computer). This last namereflects that the increased speed results from fewer assembly level instructionsbeing made available to the programmer. In comparison, more traditionalmicroprocessors, such as the Pentium, are calledCISC (Complex InstructionSet Computer).
DSP microprocessors are used in two ways: as slave modules under the controlof a more conventional computer, or as an imbedded processor in a dedicatedapplication, such as a cellular telephone. Some models only handle fixed pointnumbers, while others can work with floating point. The internal architectureused to obtain the increased speed includes: (1) lots of very fast cache memorycontained within the chip, (2) separate buses for the program and data, allowingthe two to be accessed simultaneously (called aHarvard Architecture), (3) fasthardware for math calculations contained directly in the microprocessor, and (4)apipeline design.
Apipeline architecture breaks thehardware required for a certain task intoseveral successive stages. For example, the addition of two numbers may bedone in three pipeline stages. The first stage of the pipeline does nothing butfetch the numbers to be added from memory. The only task of the second stageis to add the two numbers together. The third stage does nothing but store theresult in memory. If each stage can complete its task in a single clock cycle, theentire procedure will take three clock cycles to execute. The key feature of thepipeline structure is that another task can be started before the previous task iscompleted. In this example, we could begin the addition ofanother twonumbers as soon as the first stage is idle, at the end of the first clock cycle. Fora large number of operations, the speed of the system will be quoted as oneaddition per clock cycle, even though the addition of any two numbers requiresthree clock cycles to complete. Pipelines are great for speed, but they can bedifficult to program. The algorithm must allow a new calculation to begin, eventhough the results of previous calculations are unavailable (because they arestill in the pipeline).