Movatterモバイル変換


[0]ホーム

URL:


Paul Hsieh's programming resources
© Copyright 1996-2016 byPaul Hsieh








It should be pointed out, that over time the idea of codeoptimization has evolved to include "execution profiling" (i.e., directmeasurement of "hotspots" in the code from a test run) as its guidingstrategy. To understand this and put it into perspective, let us first breakdown the performance of a program down as follows:

    opeq1

where

    opeq2

What is typically seen is that is usually dominated by one or very few of the termsbecause either those tasks are where there is the most work to do or wheretheir algorithms are complicated enough for them to have a low rate ofexecution. This effect tends to be recursive in the sense of breakingtasks down into sub-tasks, which has lead to the creation and adoption ofline-by-line based execution profilers.

What one should also notice is that if the strategy for improving theperformance of code is to improve the efficiency of the code for oneparticular wecan see that it's possible to reach a point of diminishing returns, becauseeach term cannot go below 0 in its overall contribution to the timeconsumption. So an optimization exercise of this sort for any particular taskneed not proceed past the point where the total time taken for any issignificantly below the average. This usually means that effort is betterspent in switching to another to optimize.

This, of course, is fine and a credible way of proceeding, however, byitself leaves out algorithm analysis, as well as the possibility of tryingdifferent task breakdowns. The classic example here is that tweaking bubblesort is not going to be as useful as switching to a better algorithm, suchas quicksort or heapsort if you are sorting a lot of data.

One also has to be careful about what actually is being measured. This isepitomized by the story that if the Unix kernel typically spends most of itstime in the idle loop, that doesn't mean there will be any useful advantagein optimizing it.






  • Simple mathematical analysis. Calculate the approximaterunning time of your algorithm (i.e., calculate its "O") takingall bottlenecks into account. Is it optimal? Can you prove it? Canyou justify up your algorithmic design with theoretically known results?

  • Understanding technological performance considerations. Evenat high levels, you need to have a rough understanding of what goes fast,what doesn't and why. Examples follow:

    (1) Data bandwidth performance for PC devices are roughlyordered (slowest to fastest): user input device, tape drives, network,CDROM, hard drive, memory mapped local BUS device (graphics memory),uncached main memory, external cached main memory, local/CPU cachedmemory, local variables (registers.)

    The usual approaches to improve the performance of data bandwidthis to use a faster device to cache the data for slower devices to allowpre-emption of slower device access if the access is redundant. Forexample, to know where you are pointing your mouse, your mouse interface islikely to just look up a pair of coordinates saved to memory, rather thanpoll the mouse directly every time. This general idea is probably whatinspired Terje Mathisen (a well-known programming optimization guru) to say:"All programming is an exercise in caching."

    (2) Arithmetic operation performance is ordered roughly:transcendental  functions, square root, modulo, divide, multiply,add/subtract/mutiply by  power of 2/divide by power of 2/modulo by apower of 2.

    Nearly all programs have to do some rudimentary calculationsof some kind. Simplifying your formulae to use faster functions is a verytypical optimization opportunity. The integer versus floating pointdifferences are usually very platform dependent, but in modernmicroprocessors, the performance of each is usually quite similar. Knowinghow data bandwidth and calculations perform relative to each other is alsovery important. For example, using tables to avoid certain recalculations isoften a good idea, but you have to look carefully at the data speed versusrecalculation; large tables will typically be uncached and may perform evenslower than a CPU divide.

    (3) Control flow is ordered roughly by: indirect function calls,switch() statements, fixed function calls, if() statements, while()statements.

    But a larger additional concern with modern pipelined processorsis the "predictability" of a control transfer. Modern processors essentiallyguess the direction of a control transfer, and back out if and when theyrealize that they are wrong (throwing away what incorrect work they havedone.) Incorrect predictions are very costly. As such one must be aware ofarithmetically equivalent ways of performing conditional computation.

  • Localized tricks/massaging.  Are your results computedby complex calculations at each stage, when an incremental approach canproduce results faster, or vice versa? Example: What is the fastestway to compute the nth Fibonacci numbers? What is the fastest way tocompute the first n Fibonacci numbers? In the former case, there is asimple formula involving an exponent calculation.  In the lattercase, the recursive definition allows for an iterative approach whichonly requires an addition and variable shuffling at each stage.

  • Parallelism.  Although the typical programming modelis a single threaded one, sometimes the width of the data stream is a lotsmaller than the naturally supported data types. More concretely, if youralgorithm operates on bytes you may be able to operate on 2, 4 or 8 of themsimultaneously by using word based instructions that are  supported byyour CPU. Alternatively, does your target system have  multipleexecution units?  For example, do you have a graphics coprocessor whichcan draw in parallel to ordinary CPU computation and do you have an API forit?

  • Taking a global view. At some stage, you should take a stepback and look at the entire flow of your algorithms. Are you using a generalarchitecture to solve a problem that could be more efficiently dealt witha specific architecture? For that matter, does your procedural breakdownallow for easy analysis of performance in the first place (see profilingbelow)?




Suppose you wish to create an arcade style fastaction game on a PC. You know that optimization will be an importantconsideration from what you've heard of how established game companies dothis sort of thing, and common knowledge on the net. If not thought throughcorrectly, this can lead you into all sorts of strange and questionabledecisions. The most typical mistake is believing that you must write yourprogram in assembler to take maximal advantage of the CPU's performance.

Action video games are multimedia applications.They require graphics, sound, possibly a network connection, and user input.In a PC, the hardware devices that support these features are completelyexternal to the CPU. In fact, the algorithmic issues of artificialintelligence, collision detection, keeping score, and keeping track of timeare ordinarily a very low performance hit on a  good CPU (like aPentium.) Understanding your multimedia devices will be far more helpful thanknowing how to optimize every cycle out of your CPU. This naturally leads tothe recommendation of using a compiler for a high-level  language like C,and using flexible device libraries.

Designing the interaction with the hardwaredevices will be much more important in terms of the quality of your game, inmore ways than just speed. The wide variety of target hardware devices onyour audience's PCs can make this a complicated problem. However, it is agood bet that using asynchronous APIs will be the most importantconsideration. For example, using potentially available coprocessors forparallel/accelerated graphics rendering is likely to be a better alternativeto CPU based pixel filling. If your audio card support can use largebuffered sound samples, you again will be better off than hand holding tinywave buffers. User input should be handled by a state mechanism rather thanpolling until some desired state to occurs (DOOM versus NetHack.) Relativelyspeaking, the network ordinarily runs at such a slow pace as to justifyrunning a separate thread to manage it. Again, state based networkmanagement is better than polling based.



Staying on the subject of games, I was recentlyworking on a special assembly language based technology optimization projectfor a 3D shoot 'em up action game. Now the original authors of this gamehave a reputation for being highly skilled programmers, so when I was firstpresented with the source code I fully expected it to be difficult to findfurther performance optimizations over what they had already done.

But what I saw did not impress me. I found I hadplenty of room to work on optimizing the original source code,before making special assembly language optimizations. I was ableto apply many common tricks that could not be found with the compiler:obvious sub-expression elimination, hoisting (factoring), tail recursionelimination, and impossible condition dead code elimination.

Well, in any event, the program stood to gain fromhigh-level improvements that proved to be a valuable aid to the low-levelimprovements added afterward via analyzing the compiler output. Had I donethis in the reverse order, I would have duplicated massive amounts ofinefficiencies that I would have spent much longer isolating, or worse mayhave missed entirely.



I was recently working on some performancesensitive code based on the reference source from an enormous software vendorbased in Redmond, WA. I was shocked to see that they were makingfundamental errors in terms performance optimization. I am talkingabout simply things likeloop hoisting. It got me to thinking thateven so called "professional programmers" charged with the goal of optimizingcode can miss really totally basic optimizations.

This is what I am talking about:

        /* Set our cap according to some variable condition */        lpDRV->dpCAPS |= x;        ...        /* This is an important inner loop */        for(i=0; i<n; i++)        {            if( (lpDRV->dpCAPS) & CLIPPING )            {                DoOneThing(i);            }            else if( (lpDRV->dpCAPS) & ALREADYCULLED )            {                DoSomethingElse(i);            }            else            {                DoYetAnotherThing(i);            }        }

Now assuming that there are no side effects whichmodify lpDRV->dpCAPS variablewithin the DoXXX() functions, doesn't theabove code scream at you to do somebasic control transformations? I meanhoist the damn "if" statements to theoutside of the loop for crying outloud!

        /* Set our cap according to some variable condition */        lpDRV->dpCAPS |= x;        ...        /* This is an important inner loop */        if( (lpDRV->dpCAPS) & CLIPPING )        {            for(i=0; i<n; i++)            {                DoOneThing(i);            }        }        else if( (lpDRV->dpCAPS) & ALREADYCULLED )        {            for(i=0; i<n; i++)            {                DoSomethingElse(i);            }        }        else        {            for(i=0; i<n; i++)            {                DoYetAnotherThing(i);            }        }

Now, lets do the math. How much are the ifstatements costing us? In the first loop its n times the overhead for allthe if-else logic. In the second loop its 1 times the overhead for all theif-else logic. So in the cases where n<1 the company in Redmond wins, butin the case where n>1 I win. This is not rocket science. I swear, if thatenormous Redmond based company only understood really basic elementarykindergarten optimizations like this, the world would be a much better place.






  • Profiling. One way or another, you have to *know* where yourperformance bottlenecks are. Good compilers will have comprehensive toolsfor measuring performance analysis which makes this job easier. But evenwith the best tools, care is often required in analyzing profilinginformation when algorithms become complex. For example, most UNIX kernelswill spend the vast majority of their processor cycles in the "idle-loop".Clearly, no amount of optimization of the "idle-loop" will have any impacton performance.

  • Disassembling. When writing in high-level languages therecan be huge complexity differences coming from subtle source codechanges. While experience can help you out a lot in this area,generally the only way to be sure is to actually produce an assemblylistings of your object code. Typically one will see things like:multiplications for address generations when a shift could do just aswell if only the array limits were a power of 2; complex "for" loopsmay generate an extra jump to the condition evaluator to conserve oncode size, when a do .. while loop is just as effective but wouldn'tgenerate the extra jump. One can also see the impact of declaringvariables or functions as "static". (With the WATCOM C/C++ compilerI've noticed that functions declarations will only "inline" if theyare static, but that local variables use shorter instructions if theyare not static.)

  • Using CPU/platform specific features(even in cross-platformsituations.) How you deal with this is obviously CPU dependent, howeverhere are some things to watch out for: Burst write's, memory accessconstraints (cache coherency), branch prediction, pipelining, counting downrather than up, side-effects, multiple execution units. With flexibleoptimizing compilers, many of these effects can be achieved by merely"massaging" your high level source code. The theory that you can inlineassembly to make any routine faster is not always true of the better, modernoptimizing compilers (C compilers anyway.)


Continuing with the second example from the highlevel examples above, I found myself faced with the problem of optimizing abunch of floating point code. Now, being that the platform was an x86 basedPC, one of the things I was struck by was an unusually high propensity ofif-statements that looked something like this:

    float x[MAX_ENTRIES];        ...        if( x[i] ) {            ....        }

Well, unfortunately, the compiler doesn't knowwhat I know. That is that all x86 processors (this was originally writtenbefore the Althon or Pentium 4 processor -- certainly with the Athlonprocessor the technique used below will have no difference in performance)are very slow at converting floating point data types into integer data typesincluding condition codes which are required for branching.

Fortunately, memory is quickly accessible fromboth floating point and integer. So relying onx being in memory whatI needed to do was match the bit patterns of eachx[i] to 0. There isa slight wrinkle here in that the IEEE specification say that both 0 and -0can be distinctly represented, but for all computation purposes must performidentically.

Well, I don't want to get into the details ofthe IEEE spec here, so I'll just show you the improvement:

    float x[MAX_ENTRIES];    long temp;        ...        temp = *((long *)&(x[i]));        if( temp + temp ) {            ....        }

This idea is non-portable; its a 32 bit IEEE specific trick. But for thesubstantial performance increase it gives, it is impossible to ignore.

Here's one sent in to me from "cdr":

A while ago I wrote a simple routine to drawtextured vertical walls DOOM-style.  Because the Z  coordinate stays constantalong the Y direction,  my routine drew by columns.  I was distantly awarethat this made my memory accesses less linear, and  therefore slower, butI figured that it wasn't that  important since my K6s 32k L1 data cachewas big  enough to fit half my texture in.   When the frame rate that resultedwere lower than  my original 33mhz 486 got under DOOM I was rather  startledand disappointed, but looking through the  code I couldn't see anythingthat should improve  speed by more than a factor of 3 or 4.  Wonderingif my compiler had gone berserk, or if I was just a  bad programmer, I startedoptimizing.   The second optimization I performed was to switch  the X andY coordinates when reading from my  texture, which used the standard 2-dimensionalarray format.  Suddenly the performance went up  10-fold!  Analyzing theaccess patterns I  realized that previously *every* read from my  texturehad been a cache miss!

Memory in modern computers is most  definitelyNOT "random access".



Click here for some more x86 based examples.





  • Ever improving hardware, makes software optimization unimportant

    (1) If you don't optimize your code, but your competition does,then they end up with a faster solution, independent of hardware improvements.(2) Hardware performance improvements expectations almost always exceedsreality. PC disk performance has remained relatively stable for  thepast 5 years, and memory technology has been driven more by quantity than performance. (3) Hardware improvements often only target optimal software solutions.

  • Using tables always beats recalculating

    This is not always true. Remember that in recalculating, youhave the potential of using parallelism, and incremental calculation withthe right formulations. Tables that are too large will not fit in your cacheand hence may be very slow to access. If your table is multi-indexed, thenthere is a hidden multiply which can be costly if it is not a power of 2.On a fast Pentium, an uncached memory access can take more time than adivide.

  • My C compiler can't be much worse than 10% off of optimalassembly on my code

    This is something some people can only be convinced of by experience.The cogitations and bizarre operator fumbling that you can do in the Clanguage convinces many that they are there for the purposes ofoptimization.  Modern C compilers usually unify C's complexoperators,  semantics and syntax into much a simpler format beforeproceeding to the optimization and code generation phase.  ANSI C onlyaddresses a  subset of your computers' capabilities and is in of itselftoo  generic in specification to take advantage of all of yourprocessor  nuances.  Remember ANSI C does not know the differencebetween cached and uncached memory.  Also many arithmetic redundanciesallow for usage of processor features that C compilers to date have not yetmastered (for e.g.., there are clever tricks for host based texturemapping if the stride of the source texture is a power of two, or  inparticular 256 on an x86.)  

    Unfortunately, many research based, and work station programmersas well as professors of higher education, who might even know better, havetaken it upon themselves to impress upon newbie programmers to avoid assemblylanguage programming at all costs; all in the name of maintainability andportability. A blatant example of this can be seen in the POV-ray FAQ, whichoutright recommends that there is no benefit to be had in attempting assemblylanguage optimizations. (I wouldn't be surprised, if you couldn't simply lowlevel optimize POV-Ray, change the interface and turn around and sell the darnthing!) The fact is, low-level optimization has its place and only should bepassed by if there is a conflicting requirement (like portability), there isno need, or there are no resources to do it. For more, seeHigh level vs. Low level below.

  • Using C compilers make it impossible to optimize code for performance

    Most C compilers come with an "inline assembly" feature that allowsyou to roll your own opcodes. Most also come with linkers that allowyou to link completely external assembly modules. Of course not allC compilers are created equal and the effects of mixing C and assemblywill vary depending on the compiler implementation.(Example:WATCOM and DJGPP mix ASM in very smoothly, whereas VC++ and Borlanddo not.)

    Modern C compilers will do a reasonable job if they are givenassistance. I usually try to break my inner loops down intothe most basic expressions possible, that are as analogous to lowlevel assembly as possible, without resorting to inline assemblywhenever possible. Again your results will vary from compiler tocompiler.(The WATCOM C/C++ compiler can be helped significantlywith this sort of approach.)

  • Compiled bitmaps are the fastest way to plotgraphics

    This method replaces each bitmap graphics source data word with aspecific CPU instruction to store it straight to graphics memory. Theproblem with it is, that it chews up large amounts of instructioncache space. This is to be compared against a data copying routinewhich needs to read the source data from memory (and typically cachesit.) Both use lots of cache space, but the compiled bitmap methoduses far more, since it must encode a CPU store command for eachsource data word.

    Furthermore, CPU performance is usually more sensitive to instruction cacheperformance than it is to data cache performance. The reason is that datamanipulations and resource contentions can be managed by write buffers andmodern CPU's ability to execute instructionsout of order. Withinstruction data, if they are not in the cache, they must be prefetched,paying non-overlapping sequential penalties whenever the pre-fetch bufferruns out.

    On older x86's this method worked well because the instructionprefetchpenalties were paid on a per instruction basis regardless(there was nocache to put them into!) But starting with the 486,this was no longera sensible solution since short loops paid noinstruction prefetch penalties,which rendered the compiled bitmaptechnique completely useless.

  • Using theregister keyword in strategic places C willimprove performance substantially

    This keyword is a complete placebo in most modern C compilers.Keep in mind that K&R and the ANSI committee did not design the Clanguage to embody all of the performance characteristics of your CPU. Thebulk of the burden of optimizing your C source, is in the hands of yourcompiler's optimizer which will typically have its own ideas about whatvariables should go where. If you are interested in the level optimizationsavailable by hand assigning register variable aliasing, you are better offgoing to hand rolled assembly, rather than relying on these kinds of languagefeatures.

    (Addenda: The onlyreal purpose of "register" isto assert to the compiler that an auto variable is never addressed andtherefore can never alias with any pointer. While this might be able toassist the compiler's optimizer, a good optimizing compiler is more thancapable of deducing this feature of a local by itself.)

  • Globals are faster than locals

    Most modern C compilers will alias local variables to your CPUsregister's or SRAM. Furthermore, if all variables in a given scope arelocal, then an optimizing compiler, can forgo maintaining the variableoutside the scope, and therefore has more simplification optimizationopportunitiesthan with globals. So, in fact, you should find the opposite tends tobe true more of the time.

  • Using smaller data types is faster than larger ones

    The original reasonint was put into the C language wasso that the fastest data type on each platform remained abstracted away fromthe programmer himself. On modern 32 and 64 bit platforms, small datatypes like chars and shorts actually incur extra overhead when convertingto and from the default machine word sized data type.

    On the other hand, one must be wary of cache usage. Usingpacked data (and in this vein, small structure fields) for large dataobjects may pay larger dividends in global cache coherence, than localalgorithmic optimization issues.

  • Fixed point always beats floating point for performance

    Most modern CPUs have a separate floating point unit that willexecute in parallel to the main/integer unit. This means that you cansimultaneously do floating point and integer calculations. While manyprocessors can perform high throughput multiplies (the Pentium being anexception) general divides and modulos that are not a power oftwo are slow to execute (from Cray Super Computers right on down to6502's; nobody has areally good algorithm to perform them ingeneral.) Parallelism (via the usually undertaxed concurrent floating pointunits in many processors) and redundancy are often better bets thangoing to fixed point.

    On the redundancy front, if you are dividing or calculating amodulo and if you know the divisor is fixed, or one of only a few possiblefixed values there areways to exploit fast integer (akafixed point) methods.

    On the Pentium, the biggest concern is moving data around toand from the FPU and the main integer unit. Optimizing FPU usage takescareful programming; no x86 compiler I have see does a particularly good jobof this. To exploit maximum optimization potential, you are likely going tohave to go to assembly language. As a rule of thumb: if you need many simpleresults as fast as possible, use fixed point, if you need only a fewcomplicated results use floating point. SeePentiumOptimizations by Agner Fog for more information.

    With the introduction of AMD's3DNOW! SIMD floatingpoint technology, these older rules about floating point performance have beenturned upside down. Approximate (14/15 bits) divides, or reciprocal squareroots can be computed in a single clock. Two multiplies and two adds can alsobe computed per clock allowing better than 1 gigaflop of peak performance. Weare now at a point in the industry where floating point performance is trulymatching the integer performance. With such technologies the right answer isto use the data type format that most closely matches its intendedmeaning.

  • Performance optimization is achieved primarily by counting thecycles of the assembly code

    You usually get a lot more mileage out of optimizing your codeat a high level (not meaning to imply that you need a HLL to do this) first.At the very least, changes to the high level source will tend to affect moretarget code at one time than what you will be able to do in assembly languagewith the same effort. In more extreme cases, such as exponential (usuallyhighly recursive) algorithms, thorough hand optimization often buys yousignificantly less than good up front design.

    The cycle counts given in processor instruction lists are usuallymisleading about the real cycle expenditure of your code.  They  usuallyignore the wait states required to access memory, or otherdevices (thatusually have their own independent clocks.)  They arealso typically misleadingwith regards to hidden side effects of  branch targets, pipelining and parallelism.

    The Pentium can take up to 39 clocks to perform a floatingpoint divide. However, the instruction can be issued in one clock, 39clocks of integer calculations can then be donein parallel and theresult of the divide then retrieved in another clock. About 41 clocks intotal, but only two of those clocks are actually spent issuing and retrievingthe results for the divide itself.

    In fact, all modern x86 processors have internal clock timers than can be usedto assist in getting real timing results and Intel recommends that programmersuse them to get accurate timing results. (See the RDTSC instruction asdocumented in Intel's processor architectural manuals.)

  • Assembly programming is only done in DOS, there's no need underWindows

    The benefits of assembly over C are the same under Windows orLinux as they are under DOS. This delusion doesn't have anything close tologic backing it up, and therefore doesn't deserve much comment.

    SeeIczelion'sWin32 Assembly Home Page if you don't believe me.

  • Complete optimization is impossible; there is always room leftto optimize, thus it is pointless to sustain too much effort in pursuit ofit

    This is not a technical belief -- its a marketing one. Its oneoften heard from the folks that live in Redmond, WA. Even to the degreethat it is true (in a very large software project, for example) it ignoresthe fact that optimal performance can be approached asymptotically with afinite, and usually acceptable amount of effort. Using proper profiling andbenchmarking one can iteratively grab the "low hanging fruit" which will getmost of the available performance.

    Absolutely optimization is also not a completely unattainablegoal. Understanding the nature of your task, and bounding it by itsinput/output performance and the best possible algorithm in the middle inmany cases is not an undoable task.

    For example, to read a file from disk, sort its contents andwrite the result back out, ought to be a very doable performanceoptimization exercise. (The input/output performance is known, and thealgorithm in the middle is approachable by considering the nature of theinput and going with a standard algorithm such as heap sort or radixsort.)

    Of course, the degree of analysis that you can apply to yourspecific problem will vary greatly depending on its nature. My mainobjection to this misconception is just that it cannot be appliedglobally.




The performance of if-then-else is one taken jump no matter what.However,often the condition has a lop-sided probability in which caseother approachesshould be considered. The elimination of branchingis an important concernwith today's deeply pipelined processorarchitectures. The reason is thata "mispredicted" branch often costsmany cycles.

Before:
    if(Condition ) {Case A;    } else {Case B;    }
After:
Case B;    if(Condition ) {        UndoCase B;Case A;    }


Clearly this only works ifUndoCase B; is possible. However,if it is, this technique has the advantage that the jump taken case can beoptimized according to theCondition probability andUndoCaseB;Case A; might be merged together to be more optimal thanexecuting each separately.

Obviously you would swap cases A and B depending on which way the probabilitygoes.  Also since this optimization is dependent on sacrificingperformance of one set of circumstances for another, you will need to time itto see if it is really worth it.  (On  processors such as the ARMor Pentium II, you can also use conditionalmov instructions toachieve a similar result.) 



Before:
    for(i=0;i<10;i++) {        printf("%d\n",i*10);    }
After:
    for(i=0;i<100;i+=10) {        printf("%d\n",i);    }

This one should be fairly obvious, use constant increments instead ofmultiplies if this is possible. (Believe it or not, however, some Ccompilers are clever enough to figure this out for you in some simplecases.)



Before:
    char playfield[80][25];
After:
    char playfield[80][32];

The advantage of using powers of two for all but the leftmost arraysize is when accessing the array. Ordinarily the compiled code wouldhave to compute a multiply to get the address of an indexed elementfrom a multidimensional array, but most compilers will replace aconstant multiply with a shift if it can. Shifts are ordinarily muchfaster than multiplies.



Many compilers can't optimize certain loops into the form most suitable forthem.  Ordinarily most CPUs have a conditional branch mechanism thatworks well when counting down from a positive number to zero or negativeone.  The conditional clause for anyfor loop must be evaluatedon the first pass, as with any other time through  the loop, but oftenis a trivialTRUE:

Before:
    for(i=0;i<100;i++) {        map[i].visited = 0;    }
After:
    i=99;    do {        map[i].visited = 0;        i--;    } while(i>=0);

Many compilers will see this optimization as well.



Often to conserve on space you will be tempted to mix integer datatypes; chars for small counters, shorts for slightly larger countersand only use longs or ints when you really have to. While this mayseem to make sense from a space utilization point of view, most CPUshave to end up wasting precious cycles to convert from one data typeto another, especially when preserving sign.

Before:
    char x;    int y;        y = x;
After:
    int x, y;        y = x;




Doing so tells the compiler that the function need not be so general as toservice arbitrary general calls from unrelated modules. If the functionis small enough, it may be inlined without having to maintain an externalcopy. If the function's address is never taken, the compiler can try tosimplify and rearrange usage of it within other functions.

Before:
   void swap(int *x, int *y) {   int t;        t = *y;        *y = *x;        *x = t;    }
After:
    static void swap(int *x, int *y) {    int t;        t = *y;        *y = *x;        *x = t;    }




The language definition for ANSI C, does not allow it take advantage ofabelian math operators for better scheduling. This will become increasinglyimportant as CPUs in the future move towards having ALUs with larger latencies(takes longer to perform operation -- this allows for high frequency for theCPU.)

Before:
    float f[100],sum;    ...    /* Back to back dependent       adds force the thoughput       to be equal to the total       FADD latency. */    sum = 0;    for(i=0;i<100;i++) {        sum += f[i];    }
After:
    float f[100],sum0,sum1;    float sum2,sum3,sum;    ...    /* Optimized for a 4-cycle       latency fully pipelined       FADD unit.  The throughput       is one add per clock. */    sum0 = sum1 = sum2 = sum3 = 0;    for(i=0;i<100;i+=4) {        sum0 += f[i];        sum1 += f[i+1];        sum2 += f[i+2];        sum3 += f[i+3];    }    sum = (sum0+sum1) + (sum2+sum3);

Some compilers allow for "trade accuracy for speed with respect to floatingpoint" switches. In theory this might allow them to see this sort oftranslation,however I have never seen any such thing.

Also, in theory, a SIMD enabled compiler could direct the aboveto a SIMDinstruction set (such as 3DNow!, SSE or AltiVec.) I definitelyhave not seenthis yet.



Techniques you might not be aware of if you have not been programming forthe past 15 years of your life:


vincent@realistix.com says...> Hi....I've got a question on optimizing critical loops.....In the context of> C/C++, which is more expensive?> IF statements or function calls? I need to choose between the two because of> the rendering options that I pass to the rendering loop.The ordering (fastest first) is roughly:1.  Well predicited if() statements (that means comparisons that followeither a very simple short pattern, or are heavily weighted, i.e., 90% ormore, to one result over the other.)2. Fixed function calls.  The address does not change.  A "static"function call will usually be faster and the fewer number of parametersthe better (though if declared static, sometimes this doesn't matter.)3. Indirect function calls that have a high probability of going to thesame address that it went to last time.  Modern CPUs will predict thatthe indirect address will be the same as last time....4. Non-predictable if() statements.  This is when the pattern of theresults of the comparison in the if() statement is not cyclical, andsways back and forth in a non-simplistic pattern.  The penalty for thisis *MUCH* higher than anything listed to this point so far.  A CPU likethe Athlon or P-!!! throws out about 10-18 clocks of work every time itguesses wrong (which you can expect to be about 50% of the time).5. Changing indirect function calls.  This is when the indirect functioncall target changes on nearly every call.  The penalty is basically thesame as a non-predictable branch, plus a few extra clocks (since it takeslonger to fetch an address than compute a comparison.)---The first three have relatively low clock counts and allow the CPU toperform parallel work if there are any opporunties.  The last two leadthe CPU to specualtively perform wrong work that needs to be undone (thisis usually just a few clocks of addition overhead).  The major point ofwhich is that the CPU waits roughly the length of its pipeline before itcan perform real work.One way to work around a situation such as 4/5 is that if there are onlytwo cases, you actually perform the work for both cases, and use apredicate mask to select between the answers.  For example to perform amax, min and abs functions on an x86 compiler:static int int_max(int a, int b) {    b = a-b;    a -= b & (b>>31);    return a;}static int int_abs(int a) {    return a - ((a+a) & (a>>31));}static int int_min(int a, int b) {    b = b-a;    a += b & (b>>31);    return a;}Notice that there is no branching.  Thus the predictability of whether ornot one result or the other is chosen does not factor into theperformance (there is no branching calculation performed.)--Paul Hsiehhttp://www.pobox.com/~qed/optimize.html



Chris Lomont wrote:: Where do you get your ideas about experienced coders? On RISC chips: and many of the pipelined, superscalar CPU's a good compiler beats out: most ASM programmers due to the complex issues for instruction: pariing ...Most assembly programmers are neither knowledgable or good, but those whoare knowledgable and good have little trouble beating compilers. Even ontrue RISC CPUs. A friend had an algorithm reimplemented, I'll redundantlystress that the algorithm was not changed, in PowerPC assembly by an Appleengineer. The result was a pixel doubling blitter that ran 3 to 5 timesfaster. I've personally seen a graphics intensive game get 5 more framesper second (25, up from 20, on a P5-100) by replacing low level graphicscode compiled by VC++ 5 with handwritten assembly. Again, same algorithm,different implementation.: ... Your hand crafted 486 asm will probably run dog slow on later: processors with different pairing rules, while a simple recompile on a: decent PII compiler will make that C code surpass you old asm ...You overstate the risk. Coincidentally I recently recompiled some codewith VC++ 5 optimizing for both Pentium and PentiumPro. Comparing some 3Dpoint transformations coded in C and Pentium assembly, the assembly coderan 30% faster than the Pentium optimized C code on a P5-100 and 12%faster than the PentiumPro optimized C code on a PII-300. While the olderarchitecture's assembly code had less of an improvement on the newerarchitecture, it was still faster, and using such old code would not be aliability as you suggest. Perhaps when spanning more than two architecuralgenerations, as in your 486 to PentiumPro example, there would be such aproblem. But given the lifespan of typical products, especially games,that is not worth worrying about.BTW, I'm not advocating that everything should be done in assembly. Justthat claiming C compilers are hard to beat is naive, and that assembly isoften a good solution at times.


More recently, I got into a discussion with my brother about how he mightoptimize a totally C++ written program to improve the performance based onbranching and floating point characteristics of modern x86 processors. AfterI did a brain dump and suggested some non-portable, assembly motivatedtechniques, he said he was able to use those suggestions alone to make hisprogram run10 times as fast!. This stuff is not trivial.

For more, seeRandallHyde's Great Debate page.

.
Paul Hsieh's Home PagePentium Optimizationx86 AssemblyProgrammingMail Paul Hsieh

Valid HTML 4.0!


[8]ページ先頭

©2009-2025 Movatter.jp