- Ever improving hardware, makes software optimization unimportant
(1) If you don't optimize your code, but your competition does,then they end up with a faster solution, independent of hardware improvements.(2) Hardware performance improvements expectations almost always exceedsreality. PC disk performance has remained relatively stable for thepast 5 years, and memory technology has been driven more by quantity than performance. (3) Hardware improvements often only target optimal software solutions.
- Using tables always beats recalculating
This is not always true. Remember that in recalculating, youhave the potential of using parallelism, and incremental calculation withthe right formulations. Tables that are too large will not fit in your cacheand hence may be very slow to access. If your table is multi-indexed, thenthere is a hidden multiply which can be costly if it is not a power of 2.On a fast Pentium, an uncached memory access can take more time than adivide.
- My C compiler can't be much worse than 10% off of optimalassembly on my code
This is something some people can only be convinced of by experience.The cogitations and bizarre operator fumbling that you can do in the Clanguage convinces many that they are there for the purposes ofoptimization. Modern C compilers usually unify C's complexoperators, semantics and syntax into much a simpler format beforeproceeding to the optimization and code generation phase. ANSI C onlyaddresses a subset of your computers' capabilities and is in of itselftoo generic in specification to take advantage of all of yourprocessor nuances. Remember ANSI C does not know the differencebetween cached and uncached memory. Also many arithmetic redundanciesallow for usage of processor features that C compilers to date have not yetmastered (for e.g.., there are clever tricks for host based texturemapping if the stride of the source texture is a power of two, or inparticular 256 on an x86.)
Unfortunately, many research based, and work station programmersas well as professors of higher education, who might even know better, havetaken it upon themselves to impress upon newbie programmers to avoid assemblylanguage programming at all costs; all in the name of maintainability andportability. A blatant example of this can be seen in the POV-ray FAQ, whichoutright recommends that there is no benefit to be had in attempting assemblylanguage optimizations. (I wouldn't be surprised, if you couldn't simply lowlevel optimize POV-Ray, change the interface and turn around and sell the darnthing!) The fact is, low-level optimization has its place and only should bepassed by if there is a conflicting requirement (like portability), there isno need, or there are no resources to do it. For more, seeHigh level vs. Low level below.
- Using C compilers make it impossible to optimize code for performance
Most C compilers come with an "inline assembly" feature that allowsyou to roll your own opcodes. Most also come with linkers that allowyou to link completely external assembly modules. Of course not allC compilers are created equal and the effects of mixing C and assemblywill vary depending on the compiler implementation.(Example:WATCOM and DJGPP mix ASM in very smoothly, whereas VC++ and Borlanddo not.)
Modern C compilers will do a reasonable job if they are givenassistance. I usually try to break my inner loops down intothe most basic expressions possible, that are as analogous to lowlevel assembly as possible, without resorting to inline assemblywhenever possible. Again your results will vary from compiler tocompiler.(The WATCOM C/C++ compiler can be helped significantlywith this sort of approach.)
- Compiled bitmaps are the fastest way to plotgraphics
This method replaces each bitmap graphics source data word with aspecific CPU instruction to store it straight to graphics memory. Theproblem with it is, that it chews up large amounts of instructioncache space. This is to be compared against a data copying routinewhich needs to read the source data from memory (and typically cachesit.) Both use lots of cache space, but the compiled bitmap methoduses far more, since it must encode a CPU store command for eachsource data word.
Furthermore, CPU performance is usually more sensitive to instruction cacheperformance than it is to data cache performance. The reason is that datamanipulations and resource contentions can be managed by write buffers andmodern CPU's ability to execute instructionsout of order. Withinstruction data, if they are not in the cache, they must be prefetched,paying non-overlapping sequential penalties whenever the pre-fetch bufferruns out.
On older x86's this method worked well because the instructionprefetchpenalties were paid on a per instruction basis regardless(there was nocache to put them into!) But starting with the 486,this was no longera sensible solution since short loops paid noinstruction prefetch penalties,which rendered the compiled bitmaptechnique completely useless.
- Using theregister keyword in strategic places C willimprove performance substantially
This keyword is a complete placebo in most modern C compilers.Keep in mind that K&R and the ANSI committee did not design the Clanguage to embody all of the performance characteristics of your CPU. Thebulk of the burden of optimizing your C source, is in the hands of yourcompiler's optimizer which will typically have its own ideas about whatvariables should go where. If you are interested in the level optimizationsavailable by hand assigning register variable aliasing, you are better offgoing to hand rolled assembly, rather than relying on these kinds of languagefeatures.
(Addenda: The onlyreal purpose of "register" isto assert to the compiler that an auto variable is never addressed andtherefore can never alias with any pointer. While this might be able toassist the compiler's optimizer, a good optimizing compiler is more thancapable of deducing this feature of a local by itself.)
- Globals are faster than locals
Most modern C compilers will alias local variables to your CPUsregister's or SRAM. Furthermore, if all variables in a given scope arelocal, then an optimizing compiler, can forgo maintaining the variableoutside the scope, and therefore has more simplification optimizationopportunitiesthan with globals. So, in fact, you should find the opposite tends tobe true more of the time.
- Using smaller data types is faster than larger ones
The original reasonint was put into the C language wasso that the fastest data type on each platform remained abstracted away fromthe programmer himself. On modern 32 and 64 bit platforms, small datatypes like chars and shorts actually incur extra overhead when convertingto and from the default machine word sized data type.
On the other hand, one must be wary of cache usage. Usingpacked data (and in this vein, small structure fields) for large dataobjects may pay larger dividends in global cache coherence, than localalgorithmic optimization issues.
- Fixed point always beats floating point for performance
Most modern CPUs have a separate floating point unit that willexecute in parallel to the main/integer unit. This means that you cansimultaneously do floating point and integer calculations. While manyprocessors can perform high throughput multiplies (the Pentium being anexception) general divides and modulos that are not a power oftwo are slow to execute (from Cray Super Computers right on down to6502's; nobody has areally good algorithm to perform them ingeneral.) Parallelism (via the usually undertaxed concurrent floating pointunits in many processors) and redundancy are often better bets thangoing to fixed point.
On the redundancy front, if you are dividing or calculating amodulo and if you know the divisor is fixed, or one of only a few possiblefixed values there areways to exploit fast integer (akafixed point) methods.
On the Pentium, the biggest concern is moving data around toand from the FPU and the main integer unit. Optimizing FPU usage takescareful programming; no x86 compiler I have see does a particularly good jobof this. To exploit maximum optimization potential, you are likely going tohave to go to assembly language. As a rule of thumb: if you need many simpleresults as fast as possible, use fixed point, if you need only a fewcomplicated results use floating point. SeePentiumOptimizations by Agner Fog for more information.
With the introduction of AMD's3DNOW! SIMD floatingpoint technology, these older rules about floating point performance have beenturned upside down. Approximate (14/15 bits) divides, or reciprocal squareroots can be computed in a single clock. Two multiplies and two adds can alsobe computed per clock allowing better than 1 gigaflop of peak performance. Weare now at a point in the industry where floating point performance is trulymatching the integer performance. With such technologies the right answer isto use the data type format that most closely matches its intendedmeaning.
- Performance optimization is achieved primarily by counting thecycles of the assembly code
You usually get a lot more mileage out of optimizing your codeat a high level (not meaning to imply that you need a HLL to do this) first.At the very least, changes to the high level source will tend to affect moretarget code at one time than what you will be able to do in assembly languagewith the same effort. In more extreme cases, such as exponential (usuallyhighly recursive) algorithms, thorough hand optimization often buys yousignificantly less than good up front design.
The cycle counts given in processor instruction lists are usuallymisleading about the real cycle expenditure of your code. They usuallyignore the wait states required to access memory, or otherdevices (thatusually have their own independent clocks.) They arealso typically misleadingwith regards to hidden side effects of branch targets, pipelining and parallelism.
The Pentium can take up to 39 clocks to perform a floatingpoint divide. However, the instruction can be issued in one clock, 39clocks of integer calculations can then be donein parallel and theresult of the divide then retrieved in another clock. About 41 clocks intotal, but only two of those clocks are actually spent issuing and retrievingthe results for the divide itself.
In fact, all modern x86 processors have internal clock timers than can be usedto assist in getting real timing results and Intel recommends that programmersuse them to get accurate timing results. (See the RDTSC instruction asdocumented in Intel's processor architectural manuals.)
- Assembly programming is only done in DOS, there's no need underWindows
The benefits of assembly over C are the same under Windows orLinux as they are under DOS. This delusion doesn't have anything close tologic backing it up, and therefore doesn't deserve much comment.
SeeIczelion'sWin32 Assembly Home Page if you don't believe me.
- Complete optimization is impossible; there is always room leftto optimize, thus it is pointless to sustain too much effort in pursuit ofit
This is not a technical belief -- its a marketing one. Its oneoften heard from the folks that live in Redmond, WA. Even to the degreethat it is true (in a very large software project, for example) it ignoresthe fact that optimal performance can be approached asymptotically with afinite, and usually acceptable amount of effort. Using proper profiling andbenchmarking one can iteratively grab the "low hanging fruit" which will getmost of the available performance.
Absolutely optimization is also not a completely unattainablegoal. Understanding the nature of your task, and bounding it by itsinput/output performance and the best possible algorithm in the middle inmany cases is not an undoable task.
For example, to read a file from disk, sort its contents andwrite the result back out, ought to be a very doable performanceoptimization exercise. (The input/output performance is known, and thealgorithm in the middle is approachable by considering the nature of theinput and going with a standard algorithm such as heap sort or radixsort.)
Of course, the degree of analysis that you can apply to yourspecific problem will vary greatly depending on its nature. My mainobjection to this misconception is just that it cannot be appliedglobally.