The Scientist and Engineer's Guide to
Digital Signal Processing
By Steven W. Smith, Ph.D.

Book Search

Download this chapter in PDF format

Chapter4.pdf

1: The Breadth and Depth of DSP
- The Roots of DSP
- Telecommunications
- Audio Processing
- Echo Location
- Image Processing
2: Statistics, Probability and Noise
- Signal and Graph Terminology
- Mean and Standard Deviation
- Signal vs. Underlying Process
- The Histogram, Pmf and Pdf
- The Normal Distribution
- Digital Noise Generation
- Precision and Accuracy
3: ADC and DAC
- Quantization
- The Sampling Theorem
- Digital-to-Analog Conversion
- Analog Filters for Data Conversion
- Selecting The Antialias Filter
- Multirate Data Conversion
- Single Bit Data Conversion
4: DSP Software
- Computer Numbers
- Fixed Point (Integers)
- Floating Point (Real Numbers)
- Number Precision
- Execution Speed: Program Language
- Execution Speed: Hardware
- Execution Speed: Programming Tips
5: Linear Systems
- Signals and Systems
- Requirements for Linearity
- Static Linearity and Sinusoidal Fidelity
- Examples of Linear and Nonlinear Systems
- Special Properties of Linearity
- Superposition: the Foundation of DSP
- Common Decompositions
- Alternatives to Linearity
6: Convolution
- The Delta Function and Impulse Response
- Convolution
- The Input Side Algorithm
- The Output Side Algorithm
- The Sum of Weighted Inputs
7: Properties of Convolution
- Common Impulse Responses
- Mathematical Properties
- Correlation
- Speed
8: The Discrete Fourier Transform
- The Family of Fourier Transform
- Notation and Format of the Real DFT
- The Frequency Domain's Independent Variable
- DFT Basis Functions
- Synthesis, Calculating the Inverse DFT
- Analysis, Calculating the DFT
- Duality
- Polar Notation
- Polar Nuisances
9: Applications of the DFT
- Spectral Analysis of Signals
- Frequency Response of Systems
- Convolution via the Frequency Domain
10: Fourier Transform Properties
- Linearity of the Fourier Transform
- Characteristics of the Phase
- Periodic Nature of the DFT
- Compression and Expansion, Multirate methods
- Multiplying Signals (Amplitude Modulation)
- The Discrete Time Fourier Transform
- Parseval's Relation
11: Fourier Transform Pairs
- Delta Function Pairs
- The Sinc Function
- Other Transform Pairs
- Gibbs Effect
- Harmonics
- Chirp Signals
12: The Fast Fourier Transform
- Real DFT Using the Complex DFT
- How the FFT works
- FFT Programs
- Speed and Precision Comparisons
- Further Speed Increases
13: Continuous Signal Processing
- The Delta Function
- Convolution
- The Fourier Transform
- The Fourier Series
14: Introduction to Digital Filters
- Filter Basics
- How Information is Represented in Signals
- Time Domain Parameters
- Frequency Domain Parameters
- High-Pass, Band-Pass and Band-Reject Filters
- Filter Classification
15: Moving Average Filters
- Implementation by Convolution
- Noise Reduction vs. Step Response
- Frequency Response
- Relatives of the Moving Average Filter
- Recursive Implementation
16: Windowed-Sinc Filters
- Strategy of the Windowed-Sinc
- Designing the Filter
- Examples of Windowed-Sinc Filters
- Pushing it to the Limit
17: Custom Filters
- Arbitrary Frequency Response
- Deconvolution
- Optimal Filters
18: FFT Convolution
- The Overlap-Add Method
- FFT Convolution
- Speed Improvements
19: Recursive Filters
- The Recursive Method
- Single Pole Recursive Filters
- Narrow-band Filters
- Phase Response
- Using Integers
20: Chebyshev Filters
- The Chebyshev and Butterworth Responses
- Designing the Filter
- Step Response Overshoot
- Stability
21: Filter Comparison
- Match #1: Analog vs. Digital Filters
- Match #2: Windowed-Sinc vs. Chebyshev
- Match #3: Moving Average vs. Single Pole
22: Audio Processing
- Human Hearing
- Timbre
- Sound Quality vs. Data Rate
- High Fidelity Audio
- Companding
- Speech Synthesis and Recognition
- Nonlinear Audio Processing
23: Image Formation & Display
- Digital Image Structure
- Cameras and Eyes
- Television Video Signals
- Other Image Acquisition and Display
- Brightness and Contrast Adjustments
- Grayscale Transforms
- Warping
24: Linear Image Processing
- Convolution
- 3x3 Edge Modification
- Convolution by Separability
- Example of a Large PSF: Illumination Flattening
- Fourier Image Analysis
- FFT Convolution
- A Closer Look at Image Convolution
25: Special Imaging Techniques
- Spatial Resolution
- Sample Spacing and Sampling Aperture
- Signal-to-Noise Ratio
- Morphological Image Processing
- Computed Tomography
26: Neural Networks (and more!)
- Target Detection
- Neural Network Architecture
- Why Does it Work?
- Training the Neural Network
- Evaluating the Results
- Recursive Filter Design
27: Data Compression
- Data Compression Strategies
- Run-Length Encoding
- Huffman Encoding
- Delta Encoding
- LZW Compression
- JPEG (Transform Compression)
- MPEG
28: Digital Signal Processors
- How DSPs are Different from Other Microprocessors
- Circular Buffering
- Architecture of the Digital Signal Processor
- Fixed versus Floating Point
- C versus Assembly
- How Fast are DSPs?
- The Digital Signal Processor Market
29: Getting Started with DSPs
- The ADSP-2106x family
- The SHARC EZ-KIT Lite
- Design Example: An FIR Audio Filter
- Analog Measurements on a DSP System
- Another Look at Fixed versus Floating Point
- Advanced Software Tools
30: Complex Numbers
- The Complex Number System
- Polar Notation
- Using Complex Numbers by Substitution
- Complex Representation of Sinusoids
- Complex Representation of Systems
- Electrical Circuit Analysis
31: The Complex Fourier Transform
- The Real DFT
- Mathematical Equivalence
- The Complex DFT
- The Family of Fourier Transforms
- Why the Complex Fourier Transform is Used
32: The Laplace Transform
- The Nature of the s-Domain
- Strategy of the Laplace Transform
- Analysis of Electric Circuits
- The Importance of Poles and Zeros
- Filter Design in the s-Domain
33: The z-Transform
- The Nature of the z-Domain
- Analysis of Recursive Systems
- Cascade and Parallel Stages
- Spectral Inversion
- Gain Changes
- Chebyshev-Butterworth Filter Design
- The Best and Worst of DSP
34: Explaining Benford's Law
- Frank Benford's Discovery
- Homomorphic Processing
- The Ones Scaling Test
- Writing Benford's Law as a Convolution
- Solving in the Frequency Domain
- Solving Mystery #1
- Solving Mystery #2
- More on Following Benford's law
- Analysis of the Log-Normal Distribution
- The Power of Signal Processing

How to order your own hardcover copy

Wouldn't you rather have a bound book instead of 640 loose pages?
Your laser printer will thank you!
Order fromAmazon.com.

Chapter 4 - DSP Software / Execution Speed: Hardware

Chapter 4: DSP Software

Execution Speed: Hardware

Computing power is increasing so rapidly, anybook on the subject will beobsolete before it is published. It's an author's nightmare! The original IBM PCwas introduced in 1981, based around the 8088 microprocessor with a 4.77MHz clock and an 8 bit data bus. This was followed by a new generation ofpersonal computers being introduced every 3-4 years: 8088 → 80286 → 80386→ 80486 → 80586 (Pentium). Each of these new systems boosted thecomputing speed by a factor of aboutfiveover the previous technology. By1996, the clock speed had increased to 200 MHz, and the data bus to 32 bits. With other improvements, this resulted in an increase in computing power ofnearlyone thousand in only 15 years! You should expectanother factor of onethousand in thenext 15 years.

The only way to obtain up-to-date information in this rapidly changing field isdirectly from the manufacturers: advertisements, specification sheets, pricelists, etc. Forget books for performance data, look in magazines and your dailynewspaper. Expect that raw computational speed will more than double eachtwo years. Learning about the current state of computer power is simply notenough; you need to understand and track how it is evolving.

Keeping this in mind, we can jump into an overview of how execution speed islimited by computer hardware. Since computers are composed of manysubsystems, the time required to execute a particular task will depend on twoprimary factors: (1) the speed of the individual subsystems, and (2) the time ittakes to transfer data between these blocks. Figure 4-5 shows a simplifieddiagram of the most important speed limiting components in a typical personnelcomputer. TheCentral Processing Unit (CPU) is the heart of the system. Aspreviously described, it consists of a dozen or so registers, each capable ofholding 32 bits (in present generation personnel computers). Also included inthe CPU is the digital electronics needed for rudimentary operations, such asmoving bits around and fixed point arithmetic.

More involved mathematics is handled by transferring the data to a specialhardware circuit called amath coprocessor (also called anarithmetic logicunit, orALU). The math coprocessor may be contained in the same chip as theCPU, or it may be a separate electronic device. For example, the addition oftwo floating point numbers would require the CPU to transfer 8 bytes (4 foreach number) to the math coprocessor, and several bytes that describe what todo with the data. After a short computational time, the math coprocessor wouldpass four bytes back to the CPU, containing the floating point number that is thesum. The most inexpensive computer systems don't have a math coprocessor,or provide it only as an option. For example, the 80486DX microprocessor hasan internal math coprocessor, while the 80486SX does not. These lowerperformance systems replacehardware withsoftware. Each of the mathematicalfunctions is broken into

elementary binary operations that can be handled directly within the CPU. While this provides the same result, the execution time is much slower, say, afactor of 10 to 20.

Most personal computer software can be used with or without a mathcoprocessor. This is accomplished by having the compiler generate machinecode to handle both cases, all stored in the final executable program. If a mathcoprocessor is present on the particular computer being used, one section of thecode will be run. If a math coprocessor is not present, the other section of thecode will be used. The compiler can also be directed to generate code for onlyone of these situations. For example, you will occasionally find a program thatrequires that a math coprocessor be present, and will crash if run on a computerthat does not have one. Applications such as word processing usually do notbenefit from a math coprocessor. This is because they involve moving dataaround in memory, not the calculation of mathematical expressions. Likewise,calculations involving fixed point variables (integers) are unaffected by thepresence of a math coprocessor, since they are handled within the CPU. On theother hand, the execution speed of DSP and other computational programs usingfloating point calculations can be an order of magnitude different with andwithout a math coprocessor.

The CPU and main memory are contained in separate chips in most computersystems. For obvious reasons, you would like the main memory to be verylarge and very fast. Unfortunately, this makes the memory very expensive. Thetransfer of data between the main memory and the CPU is a very commonbottleneck for speed. The CPUasks the main memory for the binaryinformation at a particular memory address, and then mustwait to receive theinformation. A common technique to get around this problem is to use amemory cache. This is a small amount of very fast memory used as a bufferbetween the CPU and the main memory. A few hundred kilobytes is typical. When the CPU requests the main memory to provide the binary data at aparticular address, high speed digital electronics copies asection of the mainmemory around this address into the memory cache. The next time that theCPU requests memory information, it is very likely that it will already becontained in the memory cache, making the retrieval very rapid. This is basedon the fact that programs tend to access memory locations that are nearbyneighbors of previously accessed data. In typical personnel computerapplications, the addition of a memory cache can improve the overall speed byseveral times. The memory cache may be in the same chip as the CPU, or itmay be an external electronic device.

The rate at which data can be transferred between subsystems depends on thenumber of parallel data lines provided, and the maximum rate that digitalsignals that can be passed along each line. Digital data can generally betransferred at a much higher rate within a single chip as compared totransferring data between chips. Likewise, data paths that must pass throughelectrical connectors to other printed circuit boards (i.e., a bus structure) will beslower still. This is a strong motivation for stuffing as much electronics aspossible inside the CPU.

A particularly nasty problem for computer speed isbackward compatibility.When a computer company introduces a new product, say a data acquisitioncard or a software program, they want to sell it into the largest possible market. This means that it must be compatible with most of the computers currently inuse, which could span several generations of technology. This frequently limitsthe performance of the hardware or software to that of a much older system. For example, suppose you buy an I/O card that plugs into the bus of your 200MHz Pentium personal computer, providing you with eight digital lines that cantransmit and receive data one byte at a time. You then write an assemblyprogram to rapidly transfer data between your computer and some externaldevice, such as a scientific experiment or another computer. Much to yoursurprise, the maximum data transfer rate is only about 100,000 bytes persecond, more thanone thousand times slower than the microprocessor clockrate! The villain is the ISA bus, a technology that isbackward compatible to thecomputers of the early 1980s.

Table 4-6 provides execution times for several generations of computers. Obviously, you should treat these as very rough approximations. If you wantto understandyour system, take measurements onyour system. It's quite easy;write a loop that executes amillion of some operation, and use your watch totime how long it takes. The first three systems, the 80286, 80486, and Pentium,are the standard desk-top personal computers of 1986, 1993 and 1996,respectively. The forth is a 1994 microprocessor designed especially for DSPtasks, the Texas Instruments TMS320C40.

The Pentium is faster than the 80286 system for four reasons, (1) the greaterclock speed, (2) more lines in the data bus, (3) the addition of a memory cache,and (4) a more efficient internal design, requiring fewer clock cycles perinstruction.

If the Pentium was a Cadillac, the TMS320C40 would be a Ferrari: less comfort,but blinding speed. This chip is representative of several micro-processorsspecifically designed to decrease the execution time of DSP algorithms. Othersin this category are the Intel i860, AT&T DSP3210, Motorola DSP96002, andthe Analog Devices ADSP-2171. These often go by the name:DSPmicroprocessor, orRISC (Reduced Instruction Set Computer). This last namereflects that the increased speed results from fewer assembly level instructionsbeing made available to the programmer. In comparison, more traditionalmicroprocessors, such as the Pentium, are calledCISC (Complex InstructionSet Computer).

DSP microprocessors are used in two ways: as slave modules under the controlof a more conventional computer, or as an imbedded processor in a dedicatedapplication, such as a cellular telephone. Some models only handle fixed pointnumbers, while others can work with floating point. The internal architectureused to obtain the increased speed includes: (1) lots of very fast cache memorycontained within the chip, (2) separate buses for the program and data, allowingthe two to be accessed simultaneously (called aHarvard Architecture), (3) fasthardware for math calculations contained directly in the microprocessor, and (4)apipeline design.

Apipeline architecture breaks thehardware required for a certain task intoseveral successive stages. For example, the addition of two numbers may bedone in three pipeline stages. The first stage of the pipeline does nothing butfetch the numbers to be added from memory. The only task of the second stageis to add the two numbers together. The third stage does nothing but store theresult in memory. If each stage can complete its task in a single clock cycle, theentire procedure will take three clock cycles to execute. The key feature of thepipeline structure is that another task can be started before the previous task iscompleted. In this example, we could begin the addition ofanother twonumbers as soon as the first stage is idle, at the end of the first clock cycle. Fora large number of operations, the speed of the system will be quoted as oneaddition per clock cycle, even though the addition of any two numbers requiresthree clock cycles to complete. Pipelines are great for speed, but they can bedifficult to program. The algorithm must allow a new calculation to begin, eventhough the results of previous calculations are unavailable (because they arestill in the pipeline).

Next Section:Execution Speed: Programming Tips

Movatterモバイル変換

The Scientist and Engineer's Guide to
Digital Signal Processing
By Steven W. Smith, Ph.D.

Book Search

Download this chapter in PDF format

Table of contents

How to order your own hardcover copy

Chapter 4: DSP Software

Movatterモバイル変換

The Scientist and Engineer's Guide toDigital Signal ProcessingBy Steven W. Smith, Ph.D.

Book Search

Download this chapter in PDF format

Table of contents

How to order your own hardcover copy

Chapter 4: DSP Software

The Scientist and Engineer's Guide to
Digital Signal Processing
By Steven W. Smith, Ph.D.