Movatterモバイル変換

Software pipelining

From Wikipedia, the free encyclopedia

Technique in computer programming to optimize loop execution

Incomputer science,software pipelining is a technique used tooptimize loops, in a manner that parallelshardware pipelining. Software pipelining is a type ofout-of-order execution, except that the reordering is done by acompiler (or in the case of hand writtenassembly code, by the programmer) instead of theprocessor. Somecomputer architectures have explicit support for software pipelining, notablyIntel'sIA-64 architecture.

It is important to distinguishsoftware pipelining, which is a target code technique for overlapping loop iterations, frommodulo scheduling, the currently most effective known compiler technique for generating software pipelined loops.Software pipelining has been known to assembly language programmers of machines withinstruction-level parallelism since such architectures existed. Effective compiler generation of such code dates to the invention of modulo scheduling by Rau and Glaeser.^[1]Lam showed that special hardware is unnecessary for effective modulo scheduling. Her technique,modulo variable expansion is widely used in practice.^[2]Gao et al. formulated optimal software pipelining in integer linear programming, culminating in validation of advanced heuristics in an evaluation paper.^[3] This paper has agood set of references on the topic.

Example

[edit]

Consider the following loop:

for i = 1 to bignumber  A(i)  B(i)  C(i)end

In this example, letA(i),B(i),C(i) be instructions, each operating on datai, that are dependent on each other. In other words,A(i) must complete beforeB(i) can start. For example,A could load data frommemory into aregister,B could perform some arithmetic operation on the data, andC could store the data back into memory. However, let there be no dependence between operations for different values ofi. In other words,A(2) can begin beforeA(1) finishes.

Without software pipelining, the operations execute in the following sequence:

A(1) B(1) C(1) A(2) B(2) C(2) A(3) B(3) C(3) ...

Assume that each instruction takes 3clock cycles to complete (ignore for the moment the cost of the looping control flow). Also assume (as is the case on most modern systems) that an instruction can be dispatched every cycle, as long as it has no dependencies on an instruction that is already executing. In theunpipelined case, each iteration thus takes 9 cycles to complete: 3 clock cycles forA(1), 3 clock cycles forB(1), and 3 clock cycles forC(1).

Now consider the following sequence of instructionswith software pipelining:

A(1) A(2) A(3) B(1) B(2) B(3) C(1) C(2) C(3) ...

It can be easily verified that an instruction can be dispatchedeach cycle, which means that the same 3 iterations can be executed in a total of 9 cycles, giving an average of 3 cycles per iteration.

Implementation

[edit]

Software pipelining is often used in combination withloop unrolling, and this combination of techniques is often a far better optimization than loop unrolling alone. In the example above, we could write the code as follows (assume for the moment thatbignumber is divisible by 3):

for i = 1 to (bignumber - 2) step 3  A(i)  A(i+1)  A(i+2)  B(i)  B(i+1)  B(i+2)  C(i)  C(i+1)  C(i+2)end

Of course, matters are complicated if (as is usually the case) we can't guarantee that the total number of iterations will be divisible by the number of iterations we unroll.See the article on loop unrolling for more on solutions to this problem, but note that software pipelining prevents the use ofDuff's device.^{[citation needed]}

In the general case, loop unrolling may not be the best way to implement software pipelining. Consider a loop containing instructions with a highlatency. For example, the following code:

for i = 1 to bignumber  A(i) ; 3 cycle latency  B(i) ; 3  C(i) ; 12(perhaps a floating point operation)  D(i) ; 3  E(i) ; 3  F(i) ; 3end

would require 12 iterations of the loop to be unrolled to avoid the bottleneck of instructionC. This means that the code of the loop would increase by a factor of 12 (which not only affects memory usage, but can also affectcache performance,seecode bloat). Even worse, the prologue (code before the loop for handling the case ofbignumber not divisible by 12) will likely be even larger than the code for the loop, and very probably inefficient because software pipelining cannot be used in this code (at least not without a significant amount of further code bloat). Furthermore, ifbignumber is expected to be moderate in size compared to the number of iterations unrolled (say 10-20), then the execution will spend most of its time in this inefficient prologue code, rendering the software pipelining optimization ineffectual.

By contrast, here is the software pipelining for our example (theprologue andepilogue will be explained later):

prologuefor i = 1 to (bignumber - 6)  A(i+6)  B(i+5)  C(i+4)  D(i+2) ; note that we skip i+3  E(i+1)  F(i)endepilogue

Before getting to the prologue and epilogue, which handle iterations at the beginning and end of the loop, let's verify that this code does the same thing as the original for iterations in the middle of the loop. Specifically, consider iteration 7 in the original loop. The first iteration of the pipelined loop will be the first iteration that includes an instruction from iteration 7 of the original loop. The sequence of instructions is:

Iteration 1:A(7) B(6) C(5) D(3) E(2) F(1)

Iteration 2:A(8)B(7) C(6) D(4) E(3) F(2)

Iteration 3:A(9) B(8)C(7) D(5) E(4) F(3)

Iteration 4:A(10) B(9) C(8) D(6) E(5) F(4)

Iteration 5:A(11) B(10) C(9)D(7) E(6) F(5)

Iteration 6:A(12) B(11) C(10) D(8)E(7) F(6)

Iteration 7:A(13) B(12) C(11) D(9) E(8)F(7)

However, unlike the original loop, the pipelined version avoids the bottleneck at instructionC. Note that there are 12 instructions betweenC(7) and the dependent instructionD(7), which means that the latency cycles of instructionC(7) are used for other instructions instead of being wasted.

The prologue and epilogue handle iterations at the beginning and end of the loop. Here is a possible prologue for our example above:

; loop prologue (arranged on lines for clarity)A(1)A(2), B(1)A(3), B(2), C(1)A(4), B(3), C(2) ; cannot start D(1) yetA(5), B(4), C(3), D(1)A(6), B(5), C(4), D(2), E(1)

Each line above corresponds to an iteration of the main pipelined loop, but without the instructions for iterations that have not yet begun. Similarly, the epilogue progressively removes instructions for iterations that have completed:

; loop epilogue (arranged on lines for clarity)B(bignumber), C(bignumber-1), D(bignumber-3), E(bignumber-4), F(bignumber-5)C(bignumber), D(bignumber-2), E(bignumber-3), F(bignumber-4)D(bignumber-1), E(bignumber-2), F(bignumber-3)D(bignumber), E(bignumber-1), F(bignumber-2)E(bignumber), F(bignumber-1)F(bignumber)

Difficulties of implementation

[edit]

The requirement of a prologue and epilogue is one of the major difficulties of implementing software pipelining. Note that the prologue in this example is 18 instructions, 3 times as large as the loop itself. The epilogue would also be 18 instructions. In other words, the prologue and epilogue together are6 times as large as the loop itself. While still better than attempting loop unrolling for this example, software pipelining requires a trade-off between speed and memory usage. Keep in mind, also, that if the code bloat is too large, it will affect speed anyway via a decrease in cache performance.

A further difficulty is that on many architectures, most instructions use a register as an argument, and that the specific register to use must be hard-coded into the instruction. In other words, on many architectures, it is impossible to code such an instruction as "multiply the contents of registerX and registerY and put the result in registerZ", whereX,Y, andZ are numbers taken from other registers or memory. This has often been cited as a reason that software pipelining cannot be effectively implemented on conventional architectures.

In fact,Monica Lam presents an elegant solution to this problem in her thesis,A Systolic Array Optimizing Compiler (1989) (ISBN 0-89838-300-5). She calls itmodulo variable expansion. The trick is to replicate the body of the loop after it has been scheduled, allowing different registers to be used for different values of the same variable when they have to be live at the same time. For the simplest possible example, let's suppose thatA(i) andB(i) can be issued in parallel and that the latency of the former is 2 cycles. The pipelined body could then be:

A(i+2); B(i)

Register allocation of this loop body runs into the problem that the result ofA(i+2) must stay live for two iterations. Using the same register for the result ofA(i+2) and the input ofB(i) will result in incorrect results.

However, if we replicate the scheduled loop body, the problem is solved:

 A(i+2); B(i) A(i+3); B(i+1)

Now a separate register can be allocated to the results ofA(i+2) andA(i+3). To be more concrete:

 r1 = A(i+2); B(i) = r1 r2 = A(i+3); B(i+1) = r2 i = i + 2 // Just to be clear

On the assumption that each instruction bundle reads its input registers before writing its output registers, this code is correct. At the start of the replicated loop body,r1 holds the value ofA(i+2) from the previous replicated loop iteration. Sincei has been incremented by 2 in the meantime, this is actually the value ofA(i) in this replicated loop iteration.

Of course, code replication increases code size and cache pressure just as the prologue and epilogue do. Nevertheless, for loops with large trip counts on architectures with enough instruction level parallelism, the technique easily performs well enough to be worth any increase in code size.

IA-64 implementation

[edit]

Intel's IA-64 architecture provides an example of an architecture designed with the difficulties of software pipelining in mind. Some of the architectural support for software pipelining includes:

A "rotating" register bank; instructions can refer to a register number that is redirected to a different register each iteration of the loop (eventually looping back around to the beginning). This makes the extra instructions^[specify] inserted in the previous example unnecessary.
Predicates (used to "predicate" instructions;seeBranch predication) that take their value from special looping instructions. These predicates turn on or off certain instructions in the loop, making a separate prologue and epilogue unnecessary.

References

[edit]

^B.R. Rau and C.D. Glaeser, "Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing", InProceedings of the Fourteenth Annual Workshop on Microprogramming (MICRO-14), December 1981, pages 183-198
^M. Lam, "Software pipelining: An effective scheduling technique for VLIW machines", InProceedings of the ACM SIGPLAN 88 Conference on Programming Language Design and Implementation (PLDI 88), July 1988 pages 318-328. Also published as ACM SIGPLAN Notices 23(7).
^J. Ruttenberg, G.R. Gao, A. Stoutchinin, and W. Lichtenstein, "Software pipelining showdown: optimal vs. heuristic methods in a production compiler", InProceedings of the ACM SIGPLAN 1996 Conference on Programming Language Design and Implementation, June 1996, pages 1-11. Also published as ACM SIGPLAN Notices 31(5).

v t e Compiler optimizations
Basic block	Peephole optimization Local value numbering
Loop	Automatic parallelization Automatic vectorization Induction variable Loop fusion Loop-invariant code motion Loop inversion Loop interchange Loop nest optimization Loop splitting Loop unrolling Loop unswitching Software pipelining Strength reduction
Data-flow analysis	Available expression Common subexpression elimination Constant folding Dead store elimination Induction variable recognition and elimination Live-variable analysis Upwards exposed uses Use-define chain Reaching definitions
SSA-based	Global value numbering Sparse conditional constant propagation
Code generation	Instruction scheduling Instruction selection Register allocation Rematerialization
Functional	Deforestation Tail-call elimination
Global	Interprocedural optimization
Other	Bounds-checking elimination Compile-time function execution Dead-code elimination Expression templates Inline expansion Jump threading Partial evaluation Profile-guided optimization
Static analysis	Alias analysis Array-access analysis Control-flow analysis Data-flow analysis Dependence analysis Escape analysis Pointer analysis Shape analysis Value range analysis

Retrieved from "https://en.wikipedia.org/w/index.php?title=Software_pipelining&oldid=1323550239"

Category:

Compiler optimizations

Hidden categories:

[8]ページ先頭