PipeliningPipelinig is animplementation technique wherebymultiple instructions are overlapped in execution.It takes advantage of parallelism that exist among theactions needed to execute an instruction.Today pipelining is the key implementation techniqueused to make fast CPU.Each step in pipeline completes a part of an instruction.Each of these steps is called a pipe stage or a pipesegment.The stages are connected one to next to form a pipe.Instructions enter at one end, progress through thestages, and exit at the other end.
2.
PipeliningThe time requiredbetween moving an instruction one stepdown the pipeline is a processor cycle.Because all stages proceed at the same time, the length ofa processor cycle is determined by the time required for theslowest pipe stage.In a computer this processor cycle is usually 1 clock cycle(some times it is 2).If the stages are perfectly balanced, then the time perinstruction on the pipelined processor- assuming idealconditions- is equal to :Time per instruction on unpipelined machine-----------------------------------------------------------Number of pipeline stages.
3.
PipeliningUnder ideal conditionsthe speed up from pipelining equalsthe number of pipe stages.PracticallyThe stages will not be perfectly balancedPipelining involves some overhead.Pipelining yields a reduction in average execution time perinstruction.Pipelining is not visible to the programmer.
4.
Basics of aRISC Instruction SetAll operations on data apply to data in registers.The only operations that affect memory are load and store.The instruction formats are few in number with allinstructions typically being one size (same size).64 bit instructions are designated by having a D on the start.DADD is the 64 bit version of ADD instr.
5.
Basics of aRISC Instruction Set3 classes of instructions.ALU instructionsLoad and Store InstructionsBranches and jumps.ALU Instructions.Take either 2 registers orA register and a sign-extended immediateOperate on these and store the result into a 3rdregister.Eg. DADD, DSUB logical – AND, OR
6.
Basics of aRISC Instruction SetLoad and Store Instructions.Take a register source called a base register and animmediate field called offset as operands.Their sum is used as a memory address (Effective addr.)LD – Load WordSD – Store WordIn case of Load a 2ndregister operand acts as a destinationfor the data loaded from memory.In case of Store the 2ndregister operand is the source of thedata that is to be stored into memory.
7.
Basics of aRISC Instruction SetBranches and Jumps.Branches are conditional transfer of control.Two ways of specifying branch condition in RISC(i) With a set of condition bits (called condition code)(ii) By a limited set of comparisionsBetween a pair of registers orA register and zeroBranch destination is obtained by adding a sign-extendedoffset (16 bits in MIPS) to the current PC.Unconditional jumps are also provided in MIPS.
8.
A Simple PipelinedImplementationFocus on a pipeline for an integer subset of a RISCarchitecture that consists of :Load – StoreBranchInteger ALU operations.Every instruction in this subset can be implemented in at most5 clock cycles.The Five pipeline stages are as follows.Instruction Fetch cycle (IF)Instruction Decode / Register Fetch cycle (ID)Execution / Effective Address cycle (EX)Memory Access (MEM)Write Back cycle (WB)
9.
Pipeline StagesInstruction FetchCycle (IF)Send the content of PC to the memory and fetch thecurrent location from memory.Update the PC to the next sequential instruction byadding 4 to the PC (assuming 4 bytes instruction)Instruction Decode /Register Fetch cycle (ID)Decode the instruction and read the registers.Decoding is done in parallel with reading registers.This is possible because the register specifiers are ata fixed location in a RISC architecture.This technique is known as fixed field decoding.
10.
Pipeline StagesExecution /Effective Address Cycle (EX)Performing one of the 3 fuctions depending on theinstruction typeMemory referrenceALU adds the base reg and the offset to form EA.Register – Register ALU instructionPerforms the operationRegister – Immediate ALU instructionPerforms the operation on the value from registerand the sign extended immediate.Memory Access (MEM)If the instruction is a load , memory does a read using EAcomputed in previous cycle.If it is a store the memory writes the data to the locationspecified by EA.
11.
Pipeline StagesWrite BackCycle (WB)Register-Register ALU instructions. Or Load instr.Write the result into the register file whether it comesfrom the memory system (for a load) or from the ALU(for an ALU instr)In this implementationBranch instruction requires 2 cyclesStore instruction – 4 cyclesAll other instructions – 5 cycles.
Classic Pipeline StagesStartsa new instruction on each cycle.On each clock cycle another instrucion is fetched and beginsits 5 cycle execution.During each clock cycle, h/w will be executing some part ofthe five different instructions.A single ALU can not be asked to compute an effectiveaddress and perform a subtract operation at the same time.
14.
Classic Pipeline StagesBecauseregister file is used as a source in the ID stage andas a destination in the WB stage it appears twice.It is read in one part of a stage( clock cycle) and written inanother part, represented by a solid line and a dashed line.IM – Instruction MemoryDM – Data MemoryCC – clock cycle.
Pipeline RegistersTo ensurethat instructions in different states of a pipe linedo not interfere with one another,a separaion is done by introducing pipeline registersbetween successive stages of the pipeline, sothatat the end of a clock cycle all the results from agiven stage are stored into a registerthat is used as the input to the next stage on thenext clock cycle.
17.
Pipeline RegistersPipeline registersprevent interference between two differentinstructions in adjacent stages in the pipeline.The registers also play a critical role of carrying data for agiven instruction from one stage to the other.The edge-triggered property of register is critical. (valuechange instantaneously on a clock edge)Otherwise data from one instruction could interfere with theexecution of another.
18.
Basic Performance Issuesof PipeliningPipelining increases the CPU instruction throughput. (ie. thenumber of instructions completed per unit time)It does not reduces the execution time of an individual instr.It usually slightly increases the execution time of eachinstruction due to overhead in the control of the pipeline.Program runs faster, eventhough no single instruction runsfaster.The clock can run no faster than the time needed for theslowest pipeline stage.Pipeline overhead arises from the combination of pipelineregister delay and clock skew.
19.
Basic Performance Issuesof PipeliningPipeline registers add set up time – which is the time that aregister input must be stable before the clock signal thattriggers a write occurs, plus propagation delay to the clockcycle.Clock skew is a phenomenon in synchronous digital circuitsystems (such as computer systems) in whichthe same sourced clock signal arrives at differentcomponents at different times.The instantaneous difference between the readings ofany two clocks is called their skew.
Pipeline HazardsHazards aresituations, that prevent the next instruction inthe instruction stream from executing during its designatedclock cycle.Hazards reduce the performance from the ideal speedupgained by pipelining.There are 3 classes of hazards.Structural hazards - arise from resource conflicts when thehardware can not support all possible combinations ofinstructions simultaneously in overlapped execution.Data hazards - arise when an instruction depends on theresults of a previous instruction.Control hazards - arise from the pipelining of branches andother instructions that change the PC.
23.
Pipeline HazardsHazards inpipeline can make it necessary to stallthe pipeline.Avoiding a hazard often requires the someinstructions in the pipeline be allowed to proceedwhile others are delayed.When an instruction is stalled, all instructions issuedlater than the stalled instruction are also stalled.Instructions issued earlier than the stalled instructionmust continue, otherwise the hazard will never clear.As a result no new instructions are fetched duringthe stall.
24.
Performance of Pipelinewith stallsIf we ignore the cycle time overhead of pipelining and assumethe stages are perfectly balanced, then the cycle time of the twoprocessors can be equal.
25.
Performance of Pipelinewith stallsWhen all instructions take the same number of cycles,which must also equal the number of pipeline stages (alsocalled the depth of the pipeline )If there are no pipeline stalls, pipelining can improveperformance by the depth of the pipeline. (No. ofpipeline stages)
26.
Structural HazardsRequires pipeliningof functional units and duplication ofresources to allow all possible combinations of instructions inthe pipeline.If some combination of instructions cannot be accommodatedbecause of resource conflicts, the processor is said to have astructural hazard.Structural hazards arise when some functional unit is not fullypipelined.Some resource has not been duplicated enough to allow allcombinations of instructions.Why would a designer allow structural hazards?The primary reason is to reduce cost of the unit.
Data HazardsOccur whenthe pipeline changes the order of read/writeaccesses to operands so that the order differs from theorder seen by sequentially executing instructions on anunpipelined processor.Consider the pipelined execution of the followinginstructions.DADD R1,R2,R3DSUB R4,R1,R5AND R6,R1,R7OR R8,R1,R9XOR R10,R1,R11
Data HazardsAll theinstructions after DADD use the result of DADDinstruction.DADD writes the value of R1 in the WB pipe stage, but theDSUB reads the value during its ID stage.This problem is a data hazard.Unless precautions are taken to prevent it , the DSUBinstruction will read the wrong value and try to use it.AND reads R1 during CC4 will receive wrong valuebecause R1 will be updated at CC5 by DADD.XOR operates properly because its register read occurs inCC6, after register write.OR also operates without a hazard because we performthe register file reads in the second half of the cycle andthe writes in the first half.
32.
Data HazardsMinimizing DataHazard Stalls by ForwardingThe previous problem can be solved with a simple hardwaretechnique called forwarding (also called bypassing andsometimes short-circuiting ).The key insight in forwarding is that the result is not reallyneeded by the DSUB until after the DADD actually producesit.If the result can be moved from the pipeline reg where theDADD stores it to where the DSUB needs it, then the needfor a stall can be avoided.
33.
Data HazardsForwarding worksas follows1) The ALU result from both the EX/MEM and MEM/WBpipeline registers is always fed back to the ALU inputs.2) If the forwarding h/w detects that the previous ALUoperations has written the register corresponding to asource for the current ALU operation, control logic selectsthe forwarded result as the ALU input rather than the valueread from the register file.
Branch HazardsControl hazardscan cause a greater performance loss,than do data hazards.When a branch is executed, it may or may not change thePC to something other than its current value plus 4.If a branch changes the PC to its target address, it is ataken branch.If it fall through it is not taken or untaken.If instruction-i is a taken branch then the PC is normally notchanged until the end of ID, after the completion ofaddress calculation and comparison.
Branch HazardsReducing PipelineBranch Penaltiessoftware can try to minimize the branch penalty usingknowledge of the hardware scheme and of branchbehavior.Four schemes1) freeze or flush the pipeline, holding or deleting anyinstructions after the branch until the branch destination isknown.2) predicted-not-taken or predicted untaken scheme -implemented by continuing to fetch instructions as if thebranch were a normal instruction. If the branch is taken,however, we need to turn the fetched instruction into a no-op and restart the fetch at the target address.
43.
Branch Hazards3) predictedtaken scheme - no advantage in this approachfor the 5 stage pipe line.4) delayed branchbranch instructionsequential successor1branch target if takenThe sequential successor is in the branch delay slot.This instruction is executed whether or not the branch istaken.
Branch HazardsPerformance ofBranch SchemesPipeline stall cycles from branches =Branch frequency × Branch penaltyThe branch frequency and branch penalty can have acomponent from both unconditional and conditional branches.However, the latter dominate since they are more frequent
47.
Instruction Level ParallelismPipelining overlaps the execution of instructions to improveperformance. Pipelining does not reduce the execution time of aninstruction. But it reduces the total execution time of the program. This potential overlap among instructions is called“Instruction Level Parallelism”(ILP), since the instructions canbe evaluated in parallel.
48.
Instruction Level ParallelismThere are two main approaches to exploit ILP: An approach that relies on Hardware to help discover andexploit parallelism dynamically. Used in Intel Core series dominate in the desktop andserver market. An approach that relies on software technology to findparallelism, statically at Compiler time. Most processors for the PMD(Personal Mobile Device)market use static approaches. However, future processors are using dynamicapproaches
49.
Instruction Level ParallelismThe value of CPI for a pipeline processor is the sum of thebase CPI and all contributions from stalls. Pipeline CPI = Ideal pipeline CPI +Structural stalls +Data hazard stalls +Control stalls. Ideal pipeline CPI is a measure of the maximum performanceattainable by the implementation. By reducing each of the terms of the right hand side, weminimize the overall pipeline CPI or alternatively, increasethe IPC ( Instructions Per Clock)
50.
Instruction Level ParallelismThe amount of parallelism available within a basic block isquite small. Since these instructions are likely to depend upon oneanother, the amount of overlap we can exploit within a basicblock is likely to be less than the average basic blocksize. To obtain substantial performance enhancements, wemust exploit ILP across multiple basic blocks.
51.
Instruction Level ParallelismThe simplest and most common way to increase the ILP isto exploit parallelism among iterations of a loop. This type of parallelism is often called loop-levelparallelism. Consider a simple example of a loop that adds two 1000-element arrays and is completely parallel: for (i=0; i<=999; i=i+1)x[i] = x[i] + y[i]; Every iteration of the loop can overlap with any otheriteration Within each loop iteration there is little or noopportunity for overlap.
52.
Instruction Level ParallelismThere are number of techniques for converting such loop-level parallelism into instruction-level parallelism. Basically, such techniques work by unrolling the loop either statically by the compiler or dynamically by the hardware
53.
Data Dependence Determininghow one instruction depends on another iscritical to determine How much parallelism exists in a program How that parallelism can be exploited. To exploit ILP we must determine which instructions canbe executed in parallel. If two instructions are parallel, they can executesimultaneously. If two instructions are dependent, they are not parallel andmust be executed in order, although they may often bepartially overlapped
54.
Bernstein’s Conditins fordetection of Parallelism Bernstein conditions are based on the following two sets ofvariables:i. The Read set or input set Ri that consists of variablesread by the statement of instruction Ii.ii.The Write set or output set Wi that consists of variableswritten into by instruction Ii . Two instructions I1 and I2 can be executed parallelly ifthey satisfies the following conditions: R1 ∩ W2 = φ R2 ∩ W1 = φ W1 ∩ W2 = φ
55.
Data Dependence Threedifferent types of dependences: data dependences (also called true data dependences), name dependences and control dependences. Data Dependences True data dependence ( or flow dependence) Anti dependence Output dependence An instruction j is data dependent on instruction i if either ofthe following holds: Instruction i produces a result that may be used byinstruction j, or Instruction j is data dependent on instruction k andinstruction k is data dependent on instruction i
56.
Data Dependence Dependencesare a property of programs Pipeline organization determines if dependence is detected and if it causes a stall Data dependence conveys 3 things: Possibility of a hazard Order in which results must be calculated An upper bound on howmuch parallelism can possiblybe exploited. A dependence can be overcome in two different ways: Maintaining the dependence but avoiding a hazard Eliminating the dependence by transforming the code.
57.
Name Dependence Twoinstructions use the same name but no flow ofinformation associated with that name. Two types of Name Dependences between an instruction ithat precedes instruction j in program order 1) Antidependence: instruction j writes a register ormemory location that instruction i reads. The original ordering must be preserved to ensure thatinstruction i reads the correct value. 2) Output dependence: instruction i and instruction j writethe same register or memory location Ordering must be preserved to ensure that the valuefinally written corresponds to instruction j. To resolve name dependences, we use renaming techniques(register renaming)
61.
Data Hazards Ahazard is created whenever there is a dependencebetween instructions, and they are close enough that the overlap during executionwould change the order of access to the operand involvedin the dependence. Because of the dependence, we have to preserve theprogram order. Three types of Data Hazards Read after write (RAW) Write after write (WAW) Write after read (WAR)
62.
Data Hazards Readafter write (RAW) Instruction j tries to read a source before i writes it, so jincorrectly gets the old value. This hazard is the most common type It corresponds to a true data dependentce. Pgm order must be preserved to ensure that j receivesthe value from i . Write After Write (WAW) Instruction j tries to write an operand before it is writtenby i. The writes end up being performed in the wrong order. This corresponds to an output dependence
63.
Data Hazards WriteAfter Read (WAR) Instruction j tries to write a destination before it is readby i, so i incorrectly gets the new value. This hazard arises from an antidependence. Read After Read (RAR) case is not a hazard.
64.
Control Dependences Acontrol dependence determines the ordering of aninstruction, i, w.r.t. a branch instruction so that the instructioni executed in correct program order and only when it shouldbe. These control dependences must be preserved to preserveprogram order. One of the simplest examples of a control dependence is thedependence of the statements in the “then” part of an ifstatement on the branch.
65.
Control Dependences Forexample, in the code segment :if p1 {s1}if p2 {s2} S1 is control dependent on p1, and S2 is control dependenton p2 but not on p1.
66.
Control Dependences Ingeneral, two constraints are imposed by controldependences: An instruction that is control dependent on a branchcannot be moved before the branch so that its executionis no longer controlled by the branch. For example, we cannot take an instruction fromthe then portion of an if statement and move itbefore the if statement. An instruction that is not control dependent on a branchcannot be moved after the branch so that its executionis controlled by the branch. For example, we cannot take a statement beforethe if statement and move it into the then portion.
67.
Basic Compiler Techniquesfor Exposing ILP➢These techniques are crucial for processors that use staticscheduling.➢The basic compiler techniques includes:➢Scheduling the code➢Loop unrolling➢Reducing branch costs with advanced brachprediction
68.
Basic Pipeline Scheduling➢Tokeep a pipeline full,➢parallelism among instructions must be exploited by➢finding sequences of unrelated instructions thatcan be overlapped in the pipeline.➢To avoid a pipeline stall,➢the execution of a dependent instruction must be➢separated from the source instruction by adistance in clock cycles equal to➢The pipeline latency of that sourceinstruction.
69.
Basic Pipeline Scheduling➢Acompiler’s ability to perform this scheduling dependsboth on➢the amount of ILP available in the program and➢On the latencies of the functional units in thepipeline.Instructionproducing resultInstruction usingresultLatency inclock cyclesFP ALU op Another FP ALU op 3FP ALU op Store double 2Load double FP ALU op 1Load double Store double 0➢Latencies of FP operations used is given above.➢The last column is the number of intervening clock cyclesneeded to avoid a stall.
70.
Basic Pipeline Scheduling➢Weassume➢the standard five-stage integer pipeline, so thatbranches have a delay of one clock cycle.➢the functional units are fully pipelined or replicated (asmany times as the pipeline depth),➢so that an operation of any type can be issued onevery clock cycle and➢there are no structural hazards.➢The integer ALU operation latency of 0
71.
Basic Pipeline Scheduling➢Considerthe following code segment which adds a scalarto a vector:for (i=999; i>=0; i--)x[i] = x[i] + s ;➢This loop is parallel by noticing that the body of eachiteration is independent.➢The first step is to translate the above segment to MIPSassembly language.➢In the following code segment,➢R1 is initially the address of the element in the arraywith the highest address, and➢F2 contains the scalar value s.➢Register R2 is precomputed, so that 8(R2) is theaddress of the last element to operate on.
72.
Basic Pipeline Scheduling➢Thestraightforward MIPS code, not scheduled for thepipeline, looks like :➢Loop: L.D F0,0(R1) ;F0=array elementADD.D F4,F0,F2 ;add scalar in F2S.D F4,0(R1) ;store resultDADDUI R1,R1,#-8 ;decrement pointer;8 bytes (per DW)BNE R1,R2,Loop ;branch R1!=R2
73.
Basic Pipeline Scheduling➢Withoutany scheduling, the loop will execute as follows:Clock cycle issuedLoop: L.D F0,0(R1) 1stall 2ADD.D F4,F0,F2 3stall 4stall 5S.D F4,0(R1) 6DADDUI R1,R1,#-8 7stall 8BNE R1,R2,Loop 9
74.
Basic Pipeline Scheduling➢Wecan schedule the loop to obtain only two stalls andreduce the time to seven cycles:Clock cycle issuedLoop: L.D F0,0(R1) 1DADDUI R1,R1,#-8 2ADD.D F4,F0,F2 3stall 4stall 5S.D F4, 8(R1) 6BNE R1,R2,Loop 7➢Two stalls after ADD.D are for use by the S.D
75.
Basic Pipeline Scheduling➢Inthe previous example, we complete one loop iterationand store back one array element every seven clockcycles.➢The actual work of operating on the array element takesjust three (the load, add, and store) of those seven clockcycles.➢The remaining four clock cycles consist of➢loop overhead—the DADDUI and BNE—and➢two stalls.➢To eliminate these four clock cycles➢we need to get more operations relative to thenumber of overhead instructions.
76.
Loop Unrolling➢A simplescheme for increasing the number of instructionsrelative to the branch and overhead instructions is loopunrolling.➢Unrolling simply replicates the loop body multiple times,adjusting the loop termination code.➢Loop unrolling can also be used to improve scheduling.➢Because it eliminates the branch,➢it allows instructions from different iterations to bescheduled together
77.
Loop Unrolling➢If wesimply replicated the instructions when we unrolledthe loop,➢the resulting use of the same registers could preventus from effectively scheduling the loop.➢Thus, we will want to use different registers for eachiteration,➢increasing the required number of registers.
78.
Loop Unrolling withoutscheduling➢Here we assumes that the number of element is a multiple of 4.➢Note that R2 must now be set so that 32(R2) is the startingaddress of the last four elements
79.
Loop Unrolling withoutscheduling➢We have eliminated 3 branches and 3 decrements of R1 .➢Without scheduling, every operation in the unrolled loop isfollowed by a dependent operation and thus will cause astall.➢This loop will run in 27 clock cycles:➢each L.D has 1 stall, (1x4 =4)➢each ADDD has 2 stalls, (2x4 =8)➢the DADDUI has 1 stall, (1x1 =1)➢plus 14 instruction issue cycles➢Or (27/4)=6.75 clock cycles for each elements.➢This can be scheduled to improve performance significally.
Loop Unrolling withscheduling➢The execution time of the unrolled loop has dropped to atotal of 14 clock cycles.➢or 3.5 clock cycles per element,➢compared with➢9 cycles per element before any unrolling orscheduling➢7 cycles when scheduled but not unrolled.➢6.75 cycles with unrolling but no scheduling
82.
Strip mining➢In realprograms we do not usually know the upper bound onthe loop.➢Suppose it is n➢we would like to unroll the loop to make k copies of the body.➢Instead of a single unrolled loop, we generate a pair ofconsecutive loops.➢The first executes (n mod k) times and has a body thatis the original loop.➢The second is the unrolled body surrounded by an outerloop that iterates (n/k) times➢For large values of n, most of the execution time will bespent in the unrolled loop body.
83.
Loop Unrolling➢Loop unrollingis a simple but useful method for➢increasing the size of straight-line code fragments thatcan be scheduled effectively.➢Three different effects limit the gains from loop unrolling:(1) a decrease in the amount of overhead amortized witheach unroll➢If the loop is unrolled double the times(2n), theoverhead is reduced to 1/2 the overhead ofunrolling of n times.(2) code size limitations➢growth in code size may increases instuction cachemiss rate(3) compiler limitations – shortfall in registers.- Register pressure
84.
Branch PredictionLoop unrollingis one way to reduce the number of branchhazards.We can also reduce the performance losses of branchesby predicting how they will behave.Branch prediction schemes are of two types:static branch prediction (or compile-time branchprediction)dynamic branch prediction
85.
Static Branch PredictionItis the simplest one, becauseit does not rely on information about the dynamichistory of code executing.It rely on information available at compile timeIt predicts the outcome of a branch based solely on thebranch instruction.i.e., uses information that was gathered before theexecution of the program.use profile information collected from earlier runs.
86.
Dynamic Branch PredictionPredictbranches dynamically based on program behavior.It uses information about taken or not taken branchesgathered at run-time to predict the outcome of a branch.The simplest dynamic branch-prediction scheme is a branch-prediction buffer or branch history table.A branch-prediction buffer is a small memory indexed by thelower portion of the address of the branch instruction.The memory location contains a bit that says whetherthe branch was recently taken or not.
87.
Dynamic Branch PredictionDifferentbranch instructions may have the same low-orderbits.With such a buffer we don’t know the prediction is correctThe prediction is a hint that is assumed to be correct, andfetching begins in the predicted direction.If the hint turns out to be wrong, the prediction bit isinverted and stored back.This simple 1-bit prediction scheme has a performanceshortcoming:Even if a branch is almost always taken, we will likelypredict incorrectly twice, rather than once, when it is nottakensince the misprediction causes the prediction bit tobe flipped.
88.
Dynamic Branch Prediction2-bitPrediction Scheme :-To overcome the weakness of 1-bit prediction scheme,2-bit prediction schemes are often used.In a 2-bit scheme, a prediction must miss twice before it ischanged.Fig shows the finite-state diagram for a 2-bit predictionscheme.
89.
Dynamic Branch PredictionCorrelatingBranch Predictors :-The 2-bit predictor schemes use only the recent behavior ofa single branchto predict the future behavior of that branch.It may be possible to improve the prediction accuracyif we also look at the recent behavior of otherbranches rather than just the branch we are trying topredict.Branch predictors that use the behavior of other branches tomake a prediction are called correlating predictors or two-level predictors.
90.
Dynamic Branch PredictionCorrelatingBranch Predictors :-Consider the following code :if (aa == 2) // branch b1aa=0;if (bb==2) // branch b2bb=0;if (aa!=bb) { // branch b3........}The behavior of branch b3 is correlated with the behavior ofbranches b1 and b2.If branches b1 and b2 are both not taken then branch b3 willbe taken.
91.
Dynamic Branch PredictionCorrelatingBranch Predictors :-A predictor that uses only the behavior of a single branch topredict the outcome of that branch can never capture thisbehavior.Existing correlating predictors add information about thebehavior of the most recent branches to decide how topredict a given branch.For example, a (1,2) predictor usesthe behavior of the last branch to choose from among apair of 2-bit branch predictors in predicting a particularbranch.
92.
Dynamic Branch PredictionCorrelatingBranch Predictors :-In general case an (m, n) predictor usesthe behavior of the last m branches to choose from 2mbranch predictors,each of which is an n-bit predictor for a singlebranch.The attraction of this type of correlating branch predictor isthat it can yield higher prediction rates than the 2-bitscheme andrequires only a trivial amount of additional hardware.
93.
Dynamic Branch PredictionCorrelatingBranch Predictors :-The global history of the most recent m branches can berecorded in an m-bit shift register,where each bit records whether the branch wastaken or not taken.The branch prediction buffer can then be indexed usinga concatenation of the low-order bits from thebranch address with the m-bit global history.
94.
Dynamic Branch PredictionCorrelatingBranch Predictors :-For example, in a (2, 2) buffer with 64 total entries,the 4 low-order address bits of the branch (word address)andthe 2 global bits representing the behavior of the twomost recently executed branchesform a 6-bit index that can be used to index the 64counters.The number of bits in an (m, n) predictor is2m× n × Number of prediction entries selected by thebranch addressA 2-bit predictor with no global history is simply a (0,2)predictor.
Dynamic Branch PredictionTournamentPredictors :-Tournament predictors usesmultiple predictors,usually one based on global information andone based on local information, andcombining them with a selector.Tournament predictors can achieve bothbetter accuracy at medium sizes (8K–32K bits) andalso make use of very large numbers of prediction bitseffectively.
97.
Dynamic Branch PredictionTournamentPredictors :-Existing tournament predictors use a 2-bit saturating counterper branchto choose among two different predictors based onwhich predictor (local, global, or even some mix) wasmost effective in recent predictions.As in a simple 2-bit predictor,the saturating counter requires two mispredictionsbefore changing the identity of the preferred predictor.
98.
Dynamic Branch PredictionTournamentPredictors :-The advantage of a tournament predictor isits ability to select the right predictor for a particularbranch.
99.
Dynamic Branch PredictionFig:The misprediction rate for three different predictors onSPEC89(benchmark) as the total number of bits is increased.
100.
Speculation overcome controldependencybyPredicting branch outcome andSpeculatively executing instructions as ifpredictions were correct.Hardware Based Speculation
101.
Hardware-based speculation combinesthree key ideas:1) dynamic branch prediction to choose which instructionsto execute2) speculation to allow the execution of instructions beforethe control dependences are resolved (with the abilityto undo the effects of an incorrectly speculatedsequence)3) dynamic scheduling to deal with the scheduling ofdifferent combinations of basic blocks.Hardware Based Speculation
102.
Hardware-based speculation followsthe predicted flow ofdata values to choose when to execute instructions.This method of executing programs is essentially a dataflow execution: Operations execute as soon as theiroperands are available.Hardware Based Speculation
103.
The key ideabehind implementing speculation is toallow instructions to execute out of orderbut to force them to commit in order andto prevent any irrevocable action (such as updatingstate or taking an exception) until an instructioncommits.Hence, when we add speculation,we need to separate the process of completingexecution from instruction commit,since instructions may finish executionconsiderably before they are ready to commit.Hardware Based Speculation
104.
Adding the commitphase to the instruction executionsequencerequires an additional set of hardware buffers thathold the results of instructions that have finishedexecution but have not committed.This hardware buffer, reorder buffer, is also used to passresults among instructions that may be speculated.Hardware Based Speculation
105.
• The reorderbuffer (ROB) provides additionalregisters.• The ROB holds the result of an instructionbetween the time the operation associated withthe instruction completes and the time theinstruction commits.• Hence, the ROB is a source of operands forinstructions.Reorder Buffer (ROB)
106.
• With speculation,the register file is notupdated until the instruction commits ;• thus, the ROB supplies operands in theinterval between completion ofinstruction execution and instructioncommit.Reorder Buffer (ROB)
107.
Each entry inthe ROB contains four fields:the instruction type,the destination field,the value field, andthe ready field.The instruction type field indicates whether the instructionisa branch (and has no destination result),a store (which has a memory address destination), ora register operation (ALU operation or load, whichhas register destinations).Reorder Buffer (ROB)
108.
The destination fieldsuppliesthe register number (for loads and ALU operations) orthe memory address (for stores) where the instructionresult should be written.The value field is usedto hold the value of the instruction result until theinstruction commits.The ready fieldIndicates that the instruction has completed execution,and the value is ready.Reorder Buffer (ROB)
There are thefour steps involved in instruction execution:IssueExecuteWrite resultCommitSteps in Execution
111.
IssueGet an instructionfrom the instruction queue.Issue the instruction if there is an empty reservationstation and an empty slot in the ROB;send the operands to the reservation station ifthey are available in either the registers or theROB.Update the control entries to indicate the buffers arein use.The number of the ROB entry allocated for the resultis also sent to the reservation station, so that thenumber can be used to tag the result when it isplaced on the CDB (Common Data Bus).Steps in Execution
112.
IssueIf either allreservations are full or the ROB is full,then instruction issue is stalled until both haveavailable entries.Write Result :When the result is available,write it on the CDB (with the ROB tag sentwhen the instruction issued) and from the CDBinto the ROB, as well as to any reservationstations waiting for this result.Mark the reservation station as available.Steps in Execution
113.
Write Result :Specialactions are required for store instructions.If the value to be stored is available,it is written into the Value field of the ROB entryfor the store.If the value to be stored is not available yet,the CDB must be monitored until that value isbroadcast, at which time the Value field of theROB entry of the store is updated.Steps in Execution
114.
Commit :This isthe final stage of completing an instruction,after which only its result remains.There are three different sequences of actions atcommit depending on whether the committinginstruction isa branch with an incorrect prediction,a store, orany other instruction (normal commit)Steps in Execution
115.
Commit :The normalcommit case occurs when an instructionreaches the head of the ROB and its result is presentin the buffer;at this point, the processor updates the registerwith the result andremoves the instruction from the ROB.Committing a store is similar except thatmemory is updated rather than a result register.Steps in Execution
116.
Commit :When abranch with incorrect prediction reaches thehead of the ROB, it indicates that the speculation waswrong.The ROB is flushed and execution is restarted at thecorrect successor of the branch.If the branch was correctly predicted, the branch isfinished.Steps in Execution
117.
Once an instructioncommits,its entry in the ROB is reclaimed andthe register or memory destination is updated,eliminating the need for the ROB entry.If the ROB is filled, we simply stop issuing instructionsuntil an entry is made free.Steps in Execution
118.
Multithreading: Exploiting Thread-LevelParallelismto Improve Uniprocessor Throughputallows multiple threads to share thefunctional units of a single processor in an overlappingfashion.In contrast, a more general method to exploit thread-level parallelism (TLP) is with a multiprocessor that hasmultiple independent threads operating at once and inparallel.Multithreading, however, does not duplicate the entireprocessor as a multiprocessor does.Instead, multithreading shares most of the processorcore among a set of threads, duplicating only.
119.
contd..• Duplicating theper-thread state of a processor coremeans creating a separate register file, a separate PC,and a separate page table for each thread.• There are three main hardware approaches tomultithreading.1. Fine-grained multithreading switches between threadson each clock, causing the execution of instructionsfrom multiple threads to be interleaved.2. Coarse-grained multithreading switches threads onlyon costly stalls, such as level two or three cachemisses.3. Simultaneous multithreading is a variation on finegrained multithreading that arises naturally when fine-grained multithreading is implemented on top of amultiple-issue, dynamically scheduled processor.
120.
. The horizontaldimension represents theinstruction execution capability in each clock cycle. The verticaldimension represents a sequence of clock cycles. An empty (white) boxindicates that the corresponding execution slot is unused in that clockcycle. The shades of gray and black correspond to four different threadsin the multithreading processors.