Movatterモバイル変換


[0]ホーム

URL:


Uploaded by22br14851
20 views

Module 2 of apj Abdul kablam university hpc.pdf

Module 1 of apj Abdul kablam university hpc

Embed presentation

Download to read offline
PipeliningPipelinig is an implementation technique wherebymultiple instructions are overlapped in execution.It takes advantage of parallelism that exist among theactions needed to execute an instruction.Today pipelining is the key implementation techniqueused to make fast CPU.Each step in pipeline completes a part of an instruction.Each of these steps is called a pipe stage or a pipesegment.The stages are connected one to next to form a pipe.Instructions enter at one end, progress through thestages, and exit at the other end.
PipeliningThe time required between moving an instruction one stepdown the pipeline is a processor cycle.Because all stages proceed at the same time, the length ofa processor cycle is determined by the time required for theslowest pipe stage.In a computer this processor cycle is usually 1 clock cycle(some times it is 2).If the stages are perfectly balanced, then the time perinstruction on the pipelined processor- assuming idealconditions- is equal to :Time per instruction on unpipelined machine-----------------------------------------------------------Number of pipeline stages.
PipeliningUnder ideal conditions the speed up from pipelining equalsthe number of pipe stages.PracticallyThe stages will not be perfectly balancedPipelining involves some overhead.Pipelining yields a reduction in average execution time perinstruction.Pipelining is not visible to the programmer.
Basics of a RISC Instruction SetAll operations on data apply to data in registers.The only operations that affect memory are load and store.The instruction formats are few in number with allinstructions typically being one size (same size).64 bit instructions are designated by having a D on the start.DADD is the 64 bit version of ADD instr.
Basics of a RISC Instruction Set3 classes of instructions.ALU instructionsLoad and Store InstructionsBranches and jumps.ALU Instructions.Take either 2 registers orA register and a sign-extended immediateOperate on these and store the result into a 3rdregister.Eg. DADD, DSUB logical – AND, OR
Basics of a RISC Instruction SetLoad and Store Instructions.Take a register source called a base register and animmediate field called offset as operands.Their sum is used as a memory address (Effective addr.)LD – Load WordSD – Store WordIn case of Load a 2ndregister operand acts as a destinationfor the data loaded from memory.In case of Store the 2ndregister operand is the source of thedata that is to be stored into memory.
Basics of a RISC Instruction SetBranches and Jumps.Branches are conditional transfer of control.Two ways of specifying branch condition in RISC(i) With a set of condition bits (called condition code)(ii) By a limited set of comparisionsBetween a pair of registers orA register and zeroBranch destination is obtained by adding a sign-extendedoffset (16 bits in MIPS) to the current PC.Unconditional jumps are also provided in MIPS.
A Simple Pipelined ImplementationFocus on a pipeline for an integer subset of a RISCarchitecture that consists of :Load – StoreBranchInteger ALU operations.Every instruction in this subset can be implemented in at most5 clock cycles.The Five pipeline stages are as follows.Instruction Fetch cycle (IF)Instruction Decode / Register Fetch cycle (ID)Execution / Effective Address cycle (EX)Memory Access (MEM)Write Back cycle (WB)
Pipeline StagesInstruction Fetch Cycle (IF)Send the content of PC to the memory and fetch thecurrent location from memory.Update the PC to the next sequential instruction byadding 4 to the PC (assuming 4 bytes instruction)Instruction Decode /Register Fetch cycle (ID)Decode the instruction and read the registers.Decoding is done in parallel with reading registers.This is possible because the register specifiers are ata fixed location in a RISC architecture.This technique is known as fixed field decoding.
Pipeline StagesExecution / Effective Address Cycle (EX)Performing one of the 3 fuctions depending on theinstruction typeMemory referrenceALU adds the base reg and the offset to form EA.Register – Register ALU instructionPerforms the operationRegister – Immediate ALU instructionPerforms the operation on the value from registerand the sign extended immediate.Memory Access (MEM)If the instruction is a load , memory does a read using EAcomputed in previous cycle.If it is a store the memory writes the data to the locationspecified by EA.
Pipeline StagesWrite Back Cycle (WB)Register-Register ALU instructions. Or Load instr.Write the result into the register file whether it comesfrom the memory system (for a load) or from the ALU(for an ALU instr)In this implementationBranch instruction requires 2 cyclesStore instruction – 4 cyclesAll other instructions – 5 cycles.
The Classic Five-Stage Pipeline
Classic Pipeline StagesStarts a new instruction on each cycle.On each clock cycle another instrucion is fetched and beginsits 5 cycle execution.During each clock cycle, h/w will be executing some part ofthe five different instructions.A single ALU can not be asked to compute an effectiveaddress and perform a subtract operation at the same time.
Classic Pipeline StagesBecause register file is used as a source in the ID stage andas a destination in the WB stage it appears twice.It is read in one part of a stage( clock cycle) and written inanother part, represented by a solid line and a dashed line.IM – Instruction MemoryDM – Data MemoryCC – clock cycle.
Pipeline Registers between successive pipeline stages
Pipeline RegistersTo ensure that instructions in different states of a pipe linedo not interfere with one another,a separaion is done by introducing pipeline registersbetween successive stages of the pipeline, sothatat the end of a clock cycle all the results from agiven stage are stored into a registerthat is used as the input to the next stage on thenext clock cycle.
Pipeline RegistersPipeline registers prevent interference between two differentinstructions in adjacent stages in the pipeline.The registers also play a critical role of carrying data for agiven instruction from one stage to the other.The edge-triggered property of register is critical. (valuechange instantaneously on a clock edge)Otherwise data from one instruction could interfere with theexecution of another.
Basic Performance Issues of PipeliningPipelining increases the CPU instruction throughput. (ie. thenumber of instructions completed per unit time)It does not reduces the execution time of an individual instr.It usually slightly increases the execution time of eachinstruction due to overhead in the control of the pipeline.Program runs faster, eventhough no single instruction runsfaster.The clock can run no faster than the time needed for theslowest pipeline stage.Pipeline overhead arises from the combination of pipelineregister delay and clock skew.
Basic Performance Issues of PipeliningPipeline registers add set up time – which is the time that aregister input must be stable before the clock signal thattriggers a write occurs, plus propagation delay to the clockcycle.Clock skew is a phenomenon in synchronous digital circuitsystems (such as computer systems) in whichthe same sourced clock signal arrives at differentcomponents at different times.The instantaneous difference between the readings ofany two clocks is called their skew.
The Classic Five-Stage Pipeline
Pipeline Registers between successive pipeline stages
Pipeline HazardsHazards are situations, that prevent the next instruction inthe instruction stream from executing during its designatedclock cycle.Hazards reduce the performance from the ideal speedupgained by pipelining.There are 3 classes of hazards.Structural hazards - arise from resource conflicts when thehardware can not support all possible combinations ofinstructions simultaneously in overlapped execution.Data hazards - arise when an instruction depends on theresults of a previous instruction.Control hazards - arise from the pipelining of branches andother instructions that change the PC.
Pipeline HazardsHazards in pipeline can make it necessary to stallthe pipeline.Avoiding a hazard often requires the someinstructions in the pipeline be allowed to proceedwhile others are delayed.When an instruction is stalled, all instructions issuedlater than the stalled instruction are also stalled.Instructions issued earlier than the stalled instructionmust continue, otherwise the hazard will never clear.As a result no new instructions are fetched duringthe stall.
Performance of Pipeline with stallsIf we ignore the cycle time overhead of pipelining and assumethe stages are perfectly balanced, then the cycle time of the twoprocessors can be equal.
Performance of Pipeline with stallsWhen all instructions take the same number of cycles,which must also equal the number of pipeline stages (alsocalled the depth of the pipeline )If there are no pipeline stalls, pipelining can improveperformance by the depth of the pipeline. (No. ofpipeline stages)
Structural HazardsRequires pipelining of functional units and duplication ofresources to allow all possible combinations of instructions inthe pipeline.If some combination of instructions cannot be accommodatedbecause of resource conflicts, the processor is said to have astructural hazard.Structural hazards arise when some functional unit is not fullypipelined.Some resource has not been duplicated enough to allow allcombinations of instructions.Why would a designer allow structural hazards?The primary reason is to reduce cost of the unit.
Structural Hazards
Structural Hazards
Data HazardsOccur when the pipeline changes the order of read/writeaccesses to operands so that the order differs from theorder seen by sequentially executing instructions on anunpipelined processor.Consider the pipelined execution of the followinginstructions.DADD R1,R2,R3DSUB R4,R1,R5AND R6,R1,R7OR R8,R1,R9XOR R10,R1,R11
Data Hazards
Data HazardsAll the instructions after DADD use the result of DADDinstruction.DADD writes the value of R1 in the WB pipe stage, but theDSUB reads the value during its ID stage.This problem is a data hazard.Unless precautions are taken to prevent it , the DSUBinstruction will read the wrong value and try to use it.AND reads R1 during CC4 will receive wrong valuebecause R1 will be updated at CC5 by DADD.XOR operates properly because its register read occurs inCC6, after register write.OR also operates without a hazard because we performthe register file reads in the second half of the cycle andthe writes in the first half.
Data HazardsMinimizing Data Hazard Stalls by ForwardingThe previous problem can be solved with a simple hardwaretechnique called forwarding (also called bypassing andsometimes short-circuiting ).The key insight in forwarding is that the result is not reallyneeded by the DSUB until after the DADD actually producesit.If the result can be moved from the pipeline reg where theDADD stores it to where the DSUB needs it, then the needfor a stall can be avoided.
Data HazardsForwarding works as follows1) The ALU result from both the EX/MEM and MEM/WBpipeline registers is always fed back to the ALU inputs.2) If the forwarding h/w detects that the previous ALUoperations has written the register corresponding to asource for the current ALU operation, control logic selectsthe forwarded result as the ALU input rather than the valueread from the register file.
Data Hazards
Data Hazards
Data HazardsData Hazards Requiring Stallsnot all potential data hazards can be handled bybypassing.LD R1,0(R2)DSUB R4,R1,R5AND R6,R1,R7OR R8,R1,R9
Data Hazards
Data Hazards
Data Hazards
Branch HazardsControl hazards can cause a greater performance loss,than do data hazards.When a branch is executed, it may or may not change thePC to something other than its current value plus 4.If a branch changes the PC to its target address, it is ataken branch.If it fall through it is not taken or untaken.If instruction-i is a taken branch then the PC is normally notchanged until the end of ID, after the completion ofaddress calculation and comparison.
Branch HazardsFollwoing figure shows a branch causes a 1-cycle stall inthe five-stage pipeline.
Branch HazardsReducing Pipeline Branch Penaltiessoftware can try to minimize the branch penalty usingknowledge of the hardware scheme and of branchbehavior.Four schemes1) freeze or flush the pipeline, holding or deleting anyinstructions after the branch until the branch destination isknown.2) predicted-not-taken or predicted untaken scheme -implemented by continuing to fetch instructions as if thebranch were a normal instruction. If the branch is taken,however, we need to turn the fetched instruction into a no-op and restart the fetch at the target address.
Branch Hazards3) predicted taken scheme - no advantage in this approachfor the 5 stage pipe line.4) delayed branchbranch instructionsequential successor1branch target if takenThe sequential successor is in the branch delay slot.This instruction is executed whether or not the branch istaken.
Branch HazardsThe predicted-not-taken scheme and the pipeline sequencewhen the branch is untaken (top) and taken (bottom).
Branch HazardsThe pipeline behavior of the five-stage pipeline with a branchdelay is shown in figure.
Branch HazardsPerformance of Branch SchemesPipeline stall cycles from branches =Branch frequency × Branch penaltyThe branch frequency and branch penalty can have acomponent from both unconditional and conditional branches.However, the latter dominate since they are more frequent
Instruction Level Parallelism Pipelining overlaps the execution of instructions to improveperformance. Pipelining does not reduce the execution time of aninstruction. But it reduces the total execution time of the program. This potential overlap among instructions is called“Instruction Level Parallelism”(ILP), since the instructions canbe evaluated in parallel.
Instruction Level Parallelism There are two main approaches to exploit ILP: An approach that relies on Hardware to help discover andexploit parallelism dynamically. Used in Intel Core series dominate in the desktop andserver market. An approach that relies on software technology to findparallelism, statically at Compiler time. Most processors for the PMD(Personal Mobile Device)market use static approaches. However, future processors are using dynamicapproaches
Instruction Level Parallelism The value of CPI for a pipeline processor is the sum of thebase CPI and all contributions from stalls. Pipeline CPI = Ideal pipeline CPI +Structural stalls +Data hazard stalls +Control stalls. Ideal pipeline CPI is a measure of the maximum performanceattainable by the implementation. By reducing each of the terms of the right hand side, weminimize the overall pipeline CPI or alternatively, increasethe IPC ( Instructions Per Clock)
Instruction Level Parallelism The amount of parallelism available within a basic block isquite small. Since these instructions are likely to depend upon oneanother, the amount of overlap we can exploit within a basicblock is likely to be less than the average basic blocksize. To obtain substantial performance enhancements, wemust exploit ILP across multiple basic blocks.
Instruction Level Parallelism The simplest and most common way to increase the ILP isto exploit parallelism among iterations of a loop. This type of parallelism is often called loop-levelparallelism. Consider a simple example of a loop that adds two 1000-element arrays and is completely parallel: for (i=0; i<=999; i=i+1)x[i] = x[i] + y[i]; Every iteration of the loop can overlap with any otheriteration Within each loop iteration there is little or noopportunity for overlap.
Instruction Level Parallelism There are number of techniques for converting such loop-level parallelism into instruction-level parallelism. Basically, such techniques work by unrolling the loop either statically by the compiler or dynamically by the hardware
Data Dependence Determining how one instruction depends on another iscritical to determine How much parallelism exists in a program How that parallelism can be exploited. To exploit ILP we must determine which instructions canbe executed in parallel. If two instructions are parallel, they can executesimultaneously. If two instructions are dependent, they are not parallel andmust be executed in order, although they may often bepartially overlapped
Bernstein’s Conditins for detection of Parallelism Bernstein conditions are based on the following two sets ofvariables:i. The Read set or input set Ri that consists of variablesread by the statement of instruction Ii.ii.The Write set or output set Wi that consists of variableswritten into by instruction Ii . Two instructions I1 and I2 can be executed parallelly ifthey satisfies the following conditions: R1 ∩ W2 = φ R2 ∩ W1 = φ W1 ∩ W2 = φ
Data Dependence Three different types of dependences: data dependences (also called true data dependences), name dependences and control dependences. Data Dependences True data dependence ( or flow dependence) Anti dependence Output dependence An instruction j is data dependent on instruction i if either ofthe following holds: Instruction i produces a result that may be used byinstruction j, or Instruction j is data dependent on instruction k andinstruction k is data dependent on instruction i
Data Dependence Dependences are a property of programs Pipeline organization determines if dependence is detected and if it causes a stall Data dependence conveys 3 things: Possibility of a hazard Order in which results must be calculated An upper bound on howmuch parallelism can possiblybe exploited. A dependence can be overcome in two different ways: Maintaining the dependence but avoiding a hazard Eliminating the dependence by transforming the code.
Name Dependence Two instructions use the same name but no flow ofinformation associated with that name. Two types of Name Dependences between an instruction ithat precedes instruction j in program order 1) Antidependence: instruction j writes a register ormemory location that instruction i reads. The original ordering must be preserved to ensure thatinstruction i reads the correct value. 2) Output dependence: instruction i and instruction j writethe same register or memory location Ordering must be preserved to ensure that the valuefinally written corresponds to instruction j. To resolve name dependences, we use renaming techniques(register renaming)
Data Hazards A hazard is created whenever there is a dependencebetween instructions, and they are close enough that the overlap during executionwould change the order of access to the operand involvedin the dependence. Because of the dependence, we have to preserve theprogram order. Three types of Data Hazards Read after write (RAW) Write after write (WAW) Write after read (WAR)
Data Hazards Read after write (RAW) Instruction j tries to read a source before i writes it, so jincorrectly gets the old value. This hazard is the most common type It corresponds to a true data dependentce. Pgm order must be preserved to ensure that j receivesthe value from i . Write After Write (WAW) Instruction j tries to write an operand before it is writtenby i. The writes end up being performed in the wrong order. This corresponds to an output dependence
Data Hazards Write After Read (WAR) Instruction j tries to write a destination before it is readby i, so i incorrectly gets the new value. This hazard arises from an antidependence. Read After Read (RAR) case is not a hazard.
Control Dependences A control dependence determines the ordering of aninstruction, i, w.r.t. a branch instruction so that the instructioni executed in correct program order and only when it shouldbe. These control dependences must be preserved to preserveprogram order. One of the simplest examples of a control dependence is thedependence of the statements in the “then” part of an ifstatement on the branch.
Control Dependences For example, in the code segment :if p1 {s1}if p2 {s2} S1 is control dependent on p1, and S2 is control dependenton p2 but not on p1.
Control Dependences In general, two constraints are imposed by controldependences: An instruction that is control dependent on a branchcannot be moved before the branch so that its executionis no longer controlled by the branch. For example, we cannot take an instruction fromthe then portion of an if statement and move itbefore the if statement. An instruction that is not control dependent on a branchcannot be moved after the branch so that its executionis controlled by the branch. For example, we cannot take a statement beforethe if statement and move it into the then portion.
Basic Compiler Techniques for Exposing ILP➢These techniques are crucial for processors that use staticscheduling.➢The basic compiler techniques includes:➢Scheduling the code➢Loop unrolling➢Reducing branch costs with advanced brachprediction
Basic Pipeline Scheduling➢To keep a pipeline full,➢parallelism among instructions must be exploited by➢finding sequences of unrelated instructions thatcan be overlapped in the pipeline.➢To avoid a pipeline stall,➢the execution of a dependent instruction must be➢separated from the source instruction by adistance in clock cycles equal to➢The pipeline latency of that sourceinstruction.
Basic Pipeline Scheduling➢A compiler’s ability to perform this scheduling dependsboth on➢the amount of ILP available in the program and➢On the latencies of the functional units in thepipeline.Instructionproducing resultInstruction usingresultLatency inclock cyclesFP ALU op Another FP ALU op 3FP ALU op Store double 2Load double FP ALU op 1Load double Store double 0➢Latencies of FP operations used is given above.➢The last column is the number of intervening clock cyclesneeded to avoid a stall.
Basic Pipeline Scheduling➢We assume➢the standard five-stage integer pipeline, so thatbranches have a delay of one clock cycle.➢the functional units are fully pipelined or replicated (asmany times as the pipeline depth),➢so that an operation of any type can be issued onevery clock cycle and➢there are no structural hazards.➢The integer ALU operation latency of 0
Basic Pipeline Scheduling➢Consider the following code segment which adds a scalarto a vector:for (i=999; i>=0; i--)x[i] = x[i] + s ;➢This loop is parallel by noticing that the body of eachiteration is independent.➢The first step is to translate the above segment to MIPSassembly language.➢In the following code segment,➢R1 is initially the address of the element in the arraywith the highest address, and➢F2 contains the scalar value s.➢Register R2 is precomputed, so that 8(R2) is theaddress of the last element to operate on.
Basic Pipeline Scheduling➢The straightforward MIPS code, not scheduled for thepipeline, looks like :➢Loop: L.D F0,0(R1) ;F0=array elementADD.D F4,F0,F2 ;add scalar in F2S.D F4,0(R1) ;store resultDADDUI R1,R1,#-8 ;decrement pointer;8 bytes (per DW)BNE R1,R2,Loop ;branch R1!=R2
Basic Pipeline Scheduling➢Without any scheduling, the loop will execute as follows:Clock cycle issuedLoop: L.D F0,0(R1) 1stall 2ADD.D F4,F0,F2 3stall 4stall 5S.D F4,0(R1) 6DADDUI R1,R1,#-8 7stall 8BNE R1,R2,Loop 9
Basic Pipeline Scheduling➢We can schedule the loop to obtain only two stalls andreduce the time to seven cycles:Clock cycle issuedLoop: L.D F0,0(R1) 1DADDUI R1,R1,#-8 2ADD.D F4,F0,F2 3stall 4stall 5S.D F4, 8(R1) 6BNE R1,R2,Loop 7➢Two stalls after ADD.D are for use by the S.D
Basic Pipeline Scheduling➢In the previous example, we complete one loop iterationand store back one array element every seven clockcycles.➢The actual work of operating on the array element takesjust three (the load, add, and store) of those seven clockcycles.➢The remaining four clock cycles consist of➢loop overhead—the DADDUI and BNE—and➢two stalls.➢To eliminate these four clock cycles➢we need to get more operations relative to thenumber of overhead instructions.
Loop Unrolling➢A simple scheme for increasing the number of instructionsrelative to the branch and overhead instructions is loopunrolling.➢Unrolling simply replicates the loop body multiple times,adjusting the loop termination code.➢Loop unrolling can also be used to improve scheduling.➢Because it eliminates the branch,➢it allows instructions from different iterations to bescheduled together
Loop Unrolling➢If we simply replicated the instructions when we unrolledthe loop,➢the resulting use of the same registers could preventus from effectively scheduling the loop.➢Thus, we will want to use different registers for eachiteration,➢increasing the required number of registers.
Loop Unrolling without scheduling➢Here we assumes that the number of element is a multiple of 4.➢Note that R2 must now be set so that 32(R2) is the startingaddress of the last four elements
Loop Unrolling without scheduling➢We have eliminated 3 branches and 3 decrements of R1 .➢Without scheduling, every operation in the unrolled loop isfollowed by a dependent operation and thus will cause astall.➢This loop will run in 27 clock cycles:➢each L.D has 1 stall, (1x4 =4)➢each ADDD has 2 stalls, (2x4 =8)➢the DADDUI has 1 stall, (1x1 =1)➢plus 14 instruction issue cycles➢Or (27/4)=6.75 clock cycles for each elements.➢This can be scheduled to improve performance significally.
Loop Unrolling with scheduling
Loop Unrolling with scheduling➢The execution time of the unrolled loop has dropped to atotal of 14 clock cycles.➢or 3.5 clock cycles per element,➢compared with➢9 cycles per element before any unrolling orscheduling➢7 cycles when scheduled but not unrolled.➢6.75 cycles with unrolling but no scheduling
Strip mining➢In real programs we do not usually know the upper bound onthe loop.➢Suppose it is n➢we would like to unroll the loop to make k copies of the body.➢Instead of a single unrolled loop, we generate a pair ofconsecutive loops.➢The first executes (n mod k) times and has a body thatis the original loop.➢The second is the unrolled body surrounded by an outerloop that iterates (n/k) times➢For large values of n, most of the execution time will bespent in the unrolled loop body.
Loop Unrolling➢Loop unrolling is a simple but useful method for➢increasing the size of straight-line code fragments thatcan be scheduled effectively.➢Three different effects limit the gains from loop unrolling:(1) a decrease in the amount of overhead amortized witheach unroll➢If the loop is unrolled double the times(2n), theoverhead is reduced to 1/2 the overhead ofunrolling of n times.(2) code size limitations➢growth in code size may increases instuction cachemiss rate(3) compiler limitations – shortfall in registers.- Register pressure
Branch PredictionLoop unrolling is one way to reduce the number of branchhazards.We can also reduce the performance losses of branchesby predicting how they will behave.Branch prediction schemes are of two types:static branch prediction (or compile-time branchprediction)dynamic branch prediction
Static Branch PredictionIt is the simplest one, becauseit does not rely on information about the dynamichistory of code executing.It rely on information available at compile timeIt predicts the outcome of a branch based solely on thebranch instruction.i.e., uses information that was gathered before theexecution of the program.use profile information collected from earlier runs.
Dynamic Branch PredictionPredict branches dynamically based on program behavior.It uses information about taken or not taken branchesgathered at run-time to predict the outcome of a branch.The simplest dynamic branch-prediction scheme is a branch-prediction buffer or branch history table.A branch-prediction buffer is a small memory indexed by thelower portion of the address of the branch instruction.The memory location contains a bit that says whetherthe branch was recently taken or not.
Dynamic Branch PredictionDifferent branch instructions may have the same low-orderbits.With such a buffer we don’t know the prediction is correctThe prediction is a hint that is assumed to be correct, andfetching begins in the predicted direction.If the hint turns out to be wrong, the prediction bit isinverted and stored back.This simple 1-bit prediction scheme has a performanceshortcoming:Even if a branch is almost always taken, we will likelypredict incorrectly twice, rather than once, when it is nottakensince the misprediction causes the prediction bit tobe flipped.
Dynamic Branch Prediction2-bit Prediction Scheme :-To overcome the weakness of 1-bit prediction scheme,2-bit prediction schemes are often used.In a 2-bit scheme, a prediction must miss twice before it ischanged.Fig shows the finite-state diagram for a 2-bit predictionscheme.
Dynamic Branch PredictionCorrelating Branch Predictors :-The 2-bit predictor schemes use only the recent behavior ofa single branchto predict the future behavior of that branch.It may be possible to improve the prediction accuracyif we also look at the recent behavior of otherbranches rather than just the branch we are trying topredict.Branch predictors that use the behavior of other branches tomake a prediction are called correlating predictors or two-level predictors.
Dynamic Branch PredictionCorrelating Branch Predictors :-Consider the following code :if (aa == 2) // branch b1aa=0;if (bb==2) // branch b2bb=0;if (aa!=bb) { // branch b3........}The behavior of branch b3 is correlated with the behavior ofbranches b1 and b2.If branches b1 and b2 are both not taken then branch b3 willbe taken.
Dynamic Branch PredictionCorrelating Branch Predictors :-A predictor that uses only the behavior of a single branch topredict the outcome of that branch can never capture thisbehavior.Existing correlating predictors add information about thebehavior of the most recent branches to decide how topredict a given branch.For example, a (1,2) predictor usesthe behavior of the last branch to choose from among apair of 2-bit branch predictors in predicting a particularbranch.
Dynamic Branch PredictionCorrelating Branch Predictors :-In general case an (m, n) predictor usesthe behavior of the last m branches to choose from 2mbranch predictors,each of which is an n-bit predictor for a singlebranch.The attraction of this type of correlating branch predictor isthat it can yield higher prediction rates than the 2-bitscheme andrequires only a trivial amount of additional hardware.
Dynamic Branch PredictionCorrelating Branch Predictors :-The global history of the most recent m branches can berecorded in an m-bit shift register,where each bit records whether the branch wastaken or not taken.The branch prediction buffer can then be indexed usinga concatenation of the low-order bits from thebranch address with the m-bit global history.
Dynamic Branch PredictionCorrelating Branch Predictors :-For example, in a (2, 2) buffer with 64 total entries,the 4 low-order address bits of the branch (word address)andthe 2 global bits representing the behavior of the twomost recently executed branchesform a 6-bit index that can be used to index the 64counters.The number of bits in an (m, n) predictor is2m× n × Number of prediction entries selected by thebranch addressA 2-bit predictor with no global history is simply a (0,2)predictor.
Dynamic Branch PredictionCorrelating Branch Predictors :-
Dynamic Branch PredictionTournament Predictors :-Tournament predictors usesmultiple predictors,usually one based on global information andone based on local information, andcombining them with a selector.Tournament predictors can achieve bothbetter accuracy at medium sizes (8K–32K bits) andalso make use of very large numbers of prediction bitseffectively.
Dynamic Branch PredictionTournament Predictors :-Existing tournament predictors use a 2-bit saturating counterper branchto choose among two different predictors based onwhich predictor (local, global, or even some mix) wasmost effective in recent predictions.As in a simple 2-bit predictor,the saturating counter requires two mispredictionsbefore changing the identity of the preferred predictor.
Dynamic Branch PredictionTournament Predictors :-The advantage of a tournament predictor isits ability to select the right predictor for a particularbranch.
Dynamic Branch PredictionFig: The misprediction rate for three different predictors onSPEC89(benchmark) as the total number of bits is increased.
Speculation overcome control dependencybyPredicting branch outcome andSpeculatively executing instructions as ifpredictions were correct.Hardware Based Speculation
Hardware-based speculation combines three key ideas:1) dynamic branch prediction to choose which instructionsto execute2) speculation to allow the execution of instructions beforethe control dependences are resolved (with the abilityto undo the effects of an incorrectly speculatedsequence)3) dynamic scheduling to deal with the scheduling ofdifferent combinations of basic blocks.Hardware Based Speculation
Hardware-based speculation follows the predicted flow ofdata values to choose when to execute instructions.This method of executing programs is essentially a dataflow execution: Operations execute as soon as theiroperands are available.Hardware Based Speculation
The key idea behind implementing speculation is toallow instructions to execute out of orderbut to force them to commit in order andto prevent any irrevocable action (such as updatingstate or taking an exception) until an instructioncommits.Hence, when we add speculation,we need to separate the process of completingexecution from instruction commit,since instructions may finish executionconsiderably before they are ready to commit.Hardware Based Speculation
Adding the commit phase to the instruction executionsequencerequires an additional set of hardware buffers thathold the results of instructions that have finishedexecution but have not committed.This hardware buffer, reorder buffer, is also used to passresults among instructions that may be speculated.Hardware Based Speculation
• The reorder buffer (ROB) provides additionalregisters.• The ROB holds the result of an instructionbetween the time the operation associated withthe instruction completes and the time theinstruction commits.• Hence, the ROB is a source of operands forinstructions.Reorder Buffer (ROB)
• With speculation, the register file is notupdated until the instruction commits ;• thus, the ROB supplies operands in theinterval between completion ofinstruction execution and instructioncommit.Reorder Buffer (ROB)
Each entry in the ROB contains four fields:the instruction type,the destination field,the value field, andthe ready field.The instruction type field indicates whether the instructionisa branch (and has no destination result),a store (which has a memory address destination), ora register operation (ALU operation or load, whichhas register destinations).Reorder Buffer (ROB)
The destination field suppliesthe register number (for loads and ALU operations) orthe memory address (for stores) where the instructionresult should be written.The value field is usedto hold the value of the instruction result until theinstruction commits.The ready fieldIndicates that the instruction has completed execution,and the value is ready.Reorder Buffer (ROB)
Basic Structure with H/W Based Speculation
There are the four steps involved in instruction execution:IssueExecuteWrite resultCommitSteps in Execution
IssueGet an instruction from the instruction queue.Issue the instruction if there is an empty reservationstation and an empty slot in the ROB;send the operands to the reservation station ifthey are available in either the registers or theROB.Update the control entries to indicate the buffers arein use.The number of the ROB entry allocated for the resultis also sent to the reservation station, so that thenumber can be used to tag the result when it isplaced on the CDB (Common Data Bus).Steps in Execution
IssueIf either all reservations are full or the ROB is full,then instruction issue is stalled until both haveavailable entries.Write Result :When the result is available,write it on the CDB (with the ROB tag sentwhen the instruction issued) and from the CDBinto the ROB, as well as to any reservationstations waiting for this result.Mark the reservation station as available.Steps in Execution
Write Result :Special actions are required for store instructions.If the value to be stored is available,it is written into the Value field of the ROB entryfor the store.If the value to be stored is not available yet,the CDB must be monitored until that value isbroadcast, at which time the Value field of theROB entry of the store is updated.Steps in Execution
Commit :This is the final stage of completing an instruction,after which only its result remains.There are three different sequences of actions atcommit depending on whether the committinginstruction isa branch with an incorrect prediction,a store, orany other instruction (normal commit)Steps in Execution
Commit :The normal commit case occurs when an instructionreaches the head of the ROB and its result is presentin the buffer;at this point, the processor updates the registerwith the result andremoves the instruction from the ROB.Committing a store is similar except thatmemory is updated rather than a result register.Steps in Execution
Commit :When a branch with incorrect prediction reaches thehead of the ROB, it indicates that the speculation waswrong.The ROB is flushed and execution is restarted at thecorrect successor of the branch.If the branch was correctly predicted, the branch isfinished.Steps in Execution
Once an instruction commits,its entry in the ROB is reclaimed andthe register or memory destination is updated,eliminating the need for the ROB entry.If the ROB is filled, we simply stop issuing instructionsuntil an entry is made free.Steps in Execution
Multithreading: Exploiting Thread-LevelParallelism to Improve Uniprocessor Throughputallows multiple threads to share thefunctional units of a single processor in an overlappingfashion.In contrast, a more general method to exploit thread-level parallelism (TLP) is with a multiprocessor that hasmultiple independent threads operating at once and inparallel.Multithreading, however, does not duplicate the entireprocessor as a multiprocessor does.Instead, multithreading shares most of the processorcore among a set of threads, duplicating only.
contd..• Duplicating the per-thread state of a processor coremeans creating a separate register file, a separate PC,and a separate page table for each thread.• There are three main hardware approaches tomultithreading.1. Fine-grained multithreading switches between threadson each clock, causing the execution of instructionsfrom multiple threads to be interleaved.2. Coarse-grained multithreading switches threads onlyon costly stalls, such as level two or three cachemisses.3. Simultaneous multithreading is a variation on finegrained multithreading that arises naturally when fine-grained multithreading is implemented on top of amultiple-issue, dynamically scheduled processor.
. The horizontal dimension represents theinstruction execution capability in each clock cycle. The verticaldimension represents a sequence of clock cycles. An empty (white) boxindicates that the corresponding execution slot is unused in that clockcycle. The shades of gray and black correspond to four different threadsin the multithreading processors.
End of Module 2

Recommended

PPT
Pipeline hazard
PPTX
3 Pipelining
PPT
Pipelining slides
PPT
Coa.ppt2
PPTX
Lecture-9 Parallel-processing .pptx
PPT
Computer architecture pipelining
PPT
lec04-pipelining-intro&hazards.ppt
PPTX
Assembly p1
PPTX
pipeline in computer architecture design
PPTX
CPU Pipelining and Hazards - An Introduction
PPTX
Pipelining of Processors
PDF
Pipelining in computer organization and hazards
 
PPT
Performance Enhancement with Pipelining
PPT
pipelining
PPTX
Presentation on risc pipeline
PPT
Instruction pipelining
PPTX
COA Unit-5.pptx
PPT
Pipelining In computer
PPT
Unit 3
PPTX
Pipeline & Nonpipeline Processor
PPTX
Computer organisation and architecture .
PDF
COA_Unit-3_slides_Pipeline Processing .pdf
DOC
Pipeline Mechanism
PPT
Pipelining in computer architecture
PPT
Pipelining
PPTX
Presentation1(1)
PDF
Computer SAarchitecture Lecture 6_Pip.pdf
PPTX
Core pipelining
PDF
IMPATT Diodes: Theory, Construction, Operation, and Microwave Applications"
PDF
বাংলাদেশ অর্থনৈতিক সমীক্ষা - ২০২৫ with Bookmark.pdf

More Related Content

PPT
Pipeline hazard
PPTX
3 Pipelining
PPT
Pipelining slides
PPT
Coa.ppt2
PPTX
Lecture-9 Parallel-processing .pptx
PPT
Computer architecture pipelining
PPT
lec04-pipelining-intro&hazards.ppt
PPTX
Assembly p1
Pipeline hazard
3 Pipelining
Pipelining slides
Coa.ppt2
Lecture-9 Parallel-processing .pptx
Computer architecture pipelining
lec04-pipelining-intro&hazards.ppt
Assembly p1

Similar to Module 2 of apj Abdul kablam university hpc.pdf

PPTX
pipeline in computer architecture design
PPTX
CPU Pipelining and Hazards - An Introduction
PPTX
Pipelining of Processors
PDF
Pipelining in computer organization and hazards
 
PPT
Performance Enhancement with Pipelining
PPT
pipelining
PPTX
Presentation on risc pipeline
PPT
Instruction pipelining
PPTX
COA Unit-5.pptx
PPT
Pipelining In computer
PPT
Unit 3
PPTX
Pipeline & Nonpipeline Processor
PPTX
Computer organisation and architecture .
PDF
COA_Unit-3_slides_Pipeline Processing .pdf
DOC
Pipeline Mechanism
PPT
Pipelining in computer architecture
PPT
Pipelining
PPTX
Presentation1(1)
PDF
Computer SAarchitecture Lecture 6_Pip.pdf
PPTX
Core pipelining
pipeline in computer architecture design
CPU Pipelining and Hazards - An Introduction
Pipelining of Processors
Pipelining in computer organization and hazards
 
Performance Enhancement with Pipelining
pipelining
Presentation on risc pipeline
Instruction pipelining
COA Unit-5.pptx
Pipelining In computer
Unit 3
Pipeline & Nonpipeline Processor
Computer organisation and architecture .
COA_Unit-3_slides_Pipeline Processing .pdf
Pipeline Mechanism
Pipelining in computer architecture
Pipelining
Presentation1(1)
Computer SAarchitecture Lecture 6_Pip.pdf
Core pipelining

Recently uploaded

PDF
IMPATT Diodes: Theory, Construction, Operation, and Microwave Applications"
PDF
বাংলাদেশ অর্থনৈতিক সমীক্ষা - ২০২৫ with Bookmark.pdf
PDF
ASRB NET 2025 Paper GENETICS AND PLANT BREEDING ARS, SMS & STODiscussion | Co...
PDF
Integrated Circuits: Lithography Techniques - Fundamentals and Advanced Metho...
PDF
1. Doing Academic Research: Problems and Issues, 2. Academic Research Writing...
PPTX
Plant Breeding: Its History and Contribution
PPTX
G-Protein-Coupled Receptors (GPCRs): Structure, Mechanism, and Functions
PDF
The invasion of Alexander of Macedonia in India
PDF
Unit 4_ small scale industries & Entrepreneurship
PDF
CXC-AD Associate Degree Handbook (Revised)
PDF
Hybrid Electric Vehicles Descriptive Questions
PDF
AI Chatbots and Prompt Engineering - by Ms. Oceana Wong
PPTX
Masterclass on Cybercrime, Scams & Safety Hacks.pptx
PDF
Rigor, ethics, wellbeing and resilience in the biomedical doctoral journey
 
PPTX
Time Series Analysis - Least Square Method Fitting a Linear Trend Equation
PPTX
Declaration of Helsinki Basic principles in medical research ppt.pptx
PPTX
Organize order into course in Odoo 18.2 _ Odoo 19
PPTX
Introduction to Beauty Care and Wellness Services.pptx-day fcs 3rd quarter tl...
PPTX
Session 5 Overview of the PPST and Its Indicators (COI and NCOI).pptx
PDF
Photoperiod Classification of Vegetable Plants.pdf
IMPATT Diodes: Theory, Construction, Operation, and Microwave Applications"
বাংলাদেশ অর্থনৈতিক সমীক্ষা - ২০২৫ with Bookmark.pdf
ASRB NET 2025 Paper GENETICS AND PLANT BREEDING ARS, SMS & STODiscussion | Co...
Integrated Circuits: Lithography Techniques - Fundamentals and Advanced Metho...
1. Doing Academic Research: Problems and Issues, 2. Academic Research Writing...
Plant Breeding: Its History and Contribution
G-Protein-Coupled Receptors (GPCRs): Structure, Mechanism, and Functions
The invasion of Alexander of Macedonia in India
Unit 4_ small scale industries & Entrepreneurship
CXC-AD Associate Degree Handbook (Revised)
Hybrid Electric Vehicles Descriptive Questions
AI Chatbots and Prompt Engineering - by Ms. Oceana Wong
Masterclass on Cybercrime, Scams & Safety Hacks.pptx
Rigor, ethics, wellbeing and resilience in the biomedical doctoral journey
 
Time Series Analysis - Least Square Method Fitting a Linear Trend Equation
Declaration of Helsinki Basic principles in medical research ppt.pptx
Organize order into course in Odoo 18.2 _ Odoo 19
Introduction to Beauty Care and Wellness Services.pptx-day fcs 3rd quarter tl...
Session 5 Overview of the PPST and Its Indicators (COI and NCOI).pptx
Photoperiod Classification of Vegetable Plants.pdf

Module 2 of apj Abdul kablam university hpc.pdf

  • 1.
    PipeliningPipelinig is animplementation technique wherebymultiple instructions are overlapped in execution.It takes advantage of parallelism that exist among theactions needed to execute an instruction.Today pipelining is the key implementation techniqueused to make fast CPU.Each step in pipeline completes a part of an instruction.Each of these steps is called a pipe stage or a pipesegment.The stages are connected one to next to form a pipe.Instructions enter at one end, progress through thestages, and exit at the other end.
  • 2.
    PipeliningThe time requiredbetween moving an instruction one stepdown the pipeline is a processor cycle.Because all stages proceed at the same time, the length ofa processor cycle is determined by the time required for theslowest pipe stage.In a computer this processor cycle is usually 1 clock cycle(some times it is 2).If the stages are perfectly balanced, then the time perinstruction on the pipelined processor- assuming idealconditions- is equal to :Time per instruction on unpipelined machine-----------------------------------------------------------Number of pipeline stages.
  • 3.
    PipeliningUnder ideal conditionsthe speed up from pipelining equalsthe number of pipe stages.PracticallyThe stages will not be perfectly balancedPipelining involves some overhead.Pipelining yields a reduction in average execution time perinstruction.Pipelining is not visible to the programmer.
  • 4.
    Basics of aRISC Instruction SetAll operations on data apply to data in registers.The only operations that affect memory are load and store.The instruction formats are few in number with allinstructions typically being one size (same size).64 bit instructions are designated by having a D on the start.DADD is the 64 bit version of ADD instr.
  • 5.
    Basics of aRISC Instruction Set3 classes of instructions.ALU instructionsLoad and Store InstructionsBranches and jumps.ALU Instructions.Take either 2 registers orA register and a sign-extended immediateOperate on these and store the result into a 3rdregister.Eg. DADD, DSUB logical – AND, OR
  • 6.
    Basics of aRISC Instruction SetLoad and Store Instructions.Take a register source called a base register and animmediate field called offset as operands.Their sum is used as a memory address (Effective addr.)LD – Load WordSD – Store WordIn case of Load a 2ndregister operand acts as a destinationfor the data loaded from memory.In case of Store the 2ndregister operand is the source of thedata that is to be stored into memory.
  • 7.
    Basics of aRISC Instruction SetBranches and Jumps.Branches are conditional transfer of control.Two ways of specifying branch condition in RISC(i) With a set of condition bits (called condition code)(ii) By a limited set of comparisionsBetween a pair of registers orA register and zeroBranch destination is obtained by adding a sign-extendedoffset (16 bits in MIPS) to the current PC.Unconditional jumps are also provided in MIPS.
  • 8.
    A Simple PipelinedImplementationFocus on a pipeline for an integer subset of a RISCarchitecture that consists of :Load – StoreBranchInteger ALU operations.Every instruction in this subset can be implemented in at most5 clock cycles.The Five pipeline stages are as follows.Instruction Fetch cycle (IF)Instruction Decode / Register Fetch cycle (ID)Execution / Effective Address cycle (EX)Memory Access (MEM)Write Back cycle (WB)
  • 9.
    Pipeline StagesInstruction FetchCycle (IF)Send the content of PC to the memory and fetch thecurrent location from memory.Update the PC to the next sequential instruction byadding 4 to the PC (assuming 4 bytes instruction)Instruction Decode /Register Fetch cycle (ID)Decode the instruction and read the registers.Decoding is done in parallel with reading registers.This is possible because the register specifiers are ata fixed location in a RISC architecture.This technique is known as fixed field decoding.
  • 10.
    Pipeline StagesExecution /Effective Address Cycle (EX)Performing one of the 3 fuctions depending on theinstruction typeMemory referrenceALU adds the base reg and the offset to form EA.Register – Register ALU instructionPerforms the operationRegister – Immediate ALU instructionPerforms the operation on the value from registerand the sign extended immediate.Memory Access (MEM)If the instruction is a load , memory does a read using EAcomputed in previous cycle.If it is a store the memory writes the data to the locationspecified by EA.
  • 11.
    Pipeline StagesWrite BackCycle (WB)Register-Register ALU instructions. Or Load instr.Write the result into the register file whether it comesfrom the memory system (for a load) or from the ALU(for an ALU instr)In this implementationBranch instruction requires 2 cyclesStore instruction – 4 cyclesAll other instructions – 5 cycles.
  • 12.
  • 13.
    Classic Pipeline StagesStartsa new instruction on each cycle.On each clock cycle another instrucion is fetched and beginsits 5 cycle execution.During each clock cycle, h/w will be executing some part ofthe five different instructions.A single ALU can not be asked to compute an effectiveaddress and perform a subtract operation at the same time.
  • 14.
    Classic Pipeline StagesBecauseregister file is used as a source in the ID stage andas a destination in the WB stage it appears twice.It is read in one part of a stage( clock cycle) and written inanother part, represented by a solid line and a dashed line.IM – Instruction MemoryDM – Data MemoryCC – clock cycle.
  • 15.
    Pipeline Registers betweensuccessive pipeline stages
  • 16.
    Pipeline RegistersTo ensurethat instructions in different states of a pipe linedo not interfere with one another,a separaion is done by introducing pipeline registersbetween successive stages of the pipeline, sothatat the end of a clock cycle all the results from agiven stage are stored into a registerthat is used as the input to the next stage on thenext clock cycle.
  • 17.
    Pipeline RegistersPipeline registersprevent interference between two differentinstructions in adjacent stages in the pipeline.The registers also play a critical role of carrying data for agiven instruction from one stage to the other.The edge-triggered property of register is critical. (valuechange instantaneously on a clock edge)Otherwise data from one instruction could interfere with theexecution of another.
  • 18.
    Basic Performance Issuesof PipeliningPipelining increases the CPU instruction throughput. (ie. thenumber of instructions completed per unit time)It does not reduces the execution time of an individual instr.It usually slightly increases the execution time of eachinstruction due to overhead in the control of the pipeline.Program runs faster, eventhough no single instruction runsfaster.The clock can run no faster than the time needed for theslowest pipeline stage.Pipeline overhead arises from the combination of pipelineregister delay and clock skew.
  • 19.
    Basic Performance Issuesof PipeliningPipeline registers add set up time – which is the time that aregister input must be stable before the clock signal thattriggers a write occurs, plus propagation delay to the clockcycle.Clock skew is a phenomenon in synchronous digital circuitsystems (such as computer systems) in whichthe same sourced clock signal arrives at differentcomponents at different times.The instantaneous difference between the readings ofany two clocks is called their skew.
  • 20.
  • 21.
    Pipeline Registers betweensuccessive pipeline stages
  • 22.
    Pipeline HazardsHazards aresituations, that prevent the next instruction inthe instruction stream from executing during its designatedclock cycle.Hazards reduce the performance from the ideal speedupgained by pipelining.There are 3 classes of hazards.Structural hazards - arise from resource conflicts when thehardware can not support all possible combinations ofinstructions simultaneously in overlapped execution.Data hazards - arise when an instruction depends on theresults of a previous instruction.Control hazards - arise from the pipelining of branches andother instructions that change the PC.
  • 23.
    Pipeline HazardsHazards inpipeline can make it necessary to stallthe pipeline.Avoiding a hazard often requires the someinstructions in the pipeline be allowed to proceedwhile others are delayed.When an instruction is stalled, all instructions issuedlater than the stalled instruction are also stalled.Instructions issued earlier than the stalled instructionmust continue, otherwise the hazard will never clear.As a result no new instructions are fetched duringthe stall.
  • 24.
    Performance of Pipelinewith stallsIf we ignore the cycle time overhead of pipelining and assumethe stages are perfectly balanced, then the cycle time of the twoprocessors can be equal.
  • 25.
    Performance of Pipelinewith stallsWhen all instructions take the same number of cycles,which must also equal the number of pipeline stages (alsocalled the depth of the pipeline )If there are no pipeline stalls, pipelining can improveperformance by the depth of the pipeline. (No. ofpipeline stages)
  • 26.
    Structural HazardsRequires pipeliningof functional units and duplication ofresources to allow all possible combinations of instructions inthe pipeline.If some combination of instructions cannot be accommodatedbecause of resource conflicts, the processor is said to have astructural hazard.Structural hazards arise when some functional unit is not fullypipelined.Some resource has not been duplicated enough to allow allcombinations of instructions.Why would a designer allow structural hazards?The primary reason is to reduce cost of the unit.
  • 27.
  • 28.
  • 29.
    Data HazardsOccur whenthe pipeline changes the order of read/writeaccesses to operands so that the order differs from theorder seen by sequentially executing instructions on anunpipelined processor.Consider the pipelined execution of the followinginstructions.DADD R1,R2,R3DSUB R4,R1,R5AND R6,R1,R7OR R8,R1,R9XOR R10,R1,R11
  • 30.
  • 31.
    Data HazardsAll theinstructions after DADD use the result of DADDinstruction.DADD writes the value of R1 in the WB pipe stage, but theDSUB reads the value during its ID stage.This problem is a data hazard.Unless precautions are taken to prevent it , the DSUBinstruction will read the wrong value and try to use it.AND reads R1 during CC4 will receive wrong valuebecause R1 will be updated at CC5 by DADD.XOR operates properly because its register read occurs inCC6, after register write.OR also operates without a hazard because we performthe register file reads in the second half of the cycle andthe writes in the first half.
  • 32.
    Data HazardsMinimizing DataHazard Stalls by ForwardingThe previous problem can be solved with a simple hardwaretechnique called forwarding (also called bypassing andsometimes short-circuiting ).The key insight in forwarding is that the result is not reallyneeded by the DSUB until after the DADD actually producesit.If the result can be moved from the pipeline reg where theDADD stores it to where the DSUB needs it, then the needfor a stall can be avoided.
  • 33.
    Data HazardsForwarding worksas follows1) The ALU result from both the EX/MEM and MEM/WBpipeline registers is always fed back to the ALU inputs.2) If the forwarding h/w detects that the previous ALUoperations has written the register corresponding to asource for the current ALU operation, control logic selectsthe forwarded result as the ALU input rather than the valueread from the register file.
  • 34.
  • 35.
  • 36.
    Data HazardsData HazardsRequiring Stallsnot all potential data hazards can be handled bybypassing.LD R1,0(R2)DSUB R4,R1,R5AND R6,R1,R7OR R8,R1,R9
  • 37.
  • 38.
  • 39.
  • 40.
    Branch HazardsControl hazardscan cause a greater performance loss,than do data hazards.When a branch is executed, it may or may not change thePC to something other than its current value plus 4.If a branch changes the PC to its target address, it is ataken branch.If it fall through it is not taken or untaken.If instruction-i is a taken branch then the PC is normally notchanged until the end of ID, after the completion ofaddress calculation and comparison.
  • 41.
    Branch HazardsFollwoing figureshows a branch causes a 1-cycle stall inthe five-stage pipeline.
  • 42.
    Branch HazardsReducing PipelineBranch Penaltiessoftware can try to minimize the branch penalty usingknowledge of the hardware scheme and of branchbehavior.Four schemes1) freeze or flush the pipeline, holding or deleting anyinstructions after the branch until the branch destination isknown.2) predicted-not-taken or predicted untaken scheme -implemented by continuing to fetch instructions as if thebranch were a normal instruction. If the branch is taken,however, we need to turn the fetched instruction into a no-op and restart the fetch at the target address.
  • 43.
    Branch Hazards3) predictedtaken scheme - no advantage in this approachfor the 5 stage pipe line.4) delayed branchbranch instructionsequential successor1branch target if takenThe sequential successor is in the branch delay slot.This instruction is executed whether or not the branch istaken.
  • 44.
    Branch HazardsThe predicted-not-takenscheme and the pipeline sequencewhen the branch is untaken (top) and taken (bottom).
  • 45.
    Branch HazardsThe pipelinebehavior of the five-stage pipeline with a branchdelay is shown in figure.
  • 46.
    Branch HazardsPerformance ofBranch SchemesPipeline stall cycles from branches =Branch frequency × Branch penaltyThe branch frequency and branch penalty can have acomponent from both unconditional and conditional branches.However, the latter dominate since they are more frequent
  • 47.
    Instruction Level ParallelismPipelining overlaps the execution of instructions to improveperformance. Pipelining does not reduce the execution time of aninstruction. But it reduces the total execution time of the program. This potential overlap among instructions is called“Instruction Level Parallelism”(ILP), since the instructions canbe evaluated in parallel.
  • 48.
    Instruction Level ParallelismThere are two main approaches to exploit ILP: An approach that relies on Hardware to help discover andexploit parallelism dynamically. Used in Intel Core series dominate in the desktop andserver market. An approach that relies on software technology to findparallelism, statically at Compiler time. Most processors for the PMD(Personal Mobile Device)market use static approaches. However, future processors are using dynamicapproaches
  • 49.
    Instruction Level ParallelismThe value of CPI for a pipeline processor is the sum of thebase CPI and all contributions from stalls. Pipeline CPI = Ideal pipeline CPI +Structural stalls +Data hazard stalls +Control stalls. Ideal pipeline CPI is a measure of the maximum performanceattainable by the implementation. By reducing each of the terms of the right hand side, weminimize the overall pipeline CPI or alternatively, increasethe IPC ( Instructions Per Clock)
  • 50.
    Instruction Level ParallelismThe amount of parallelism available within a basic block isquite small. Since these instructions are likely to depend upon oneanother, the amount of overlap we can exploit within a basicblock is likely to be less than the average basic blocksize. To obtain substantial performance enhancements, wemust exploit ILP across multiple basic blocks.
  • 51.
    Instruction Level ParallelismThe simplest and most common way to increase the ILP isto exploit parallelism among iterations of a loop. This type of parallelism is often called loop-levelparallelism. Consider a simple example of a loop that adds two 1000-element arrays and is completely parallel: for (i=0; i<=999; i=i+1)x[i] = x[i] + y[i]; Every iteration of the loop can overlap with any otheriteration Within each loop iteration there is little or noopportunity for overlap.
  • 52.
    Instruction Level ParallelismThere are number of techniques for converting such loop-level parallelism into instruction-level parallelism. Basically, such techniques work by unrolling the loop either statically by the compiler or dynamically by the hardware
  • 53.
    Data Dependence Determininghow one instruction depends on another iscritical to determine How much parallelism exists in a program How that parallelism can be exploited. To exploit ILP we must determine which instructions canbe executed in parallel. If two instructions are parallel, they can executesimultaneously. If two instructions are dependent, they are not parallel andmust be executed in order, although they may often bepartially overlapped
  • 54.
    Bernstein’s Conditins fordetection of Parallelism Bernstein conditions are based on the following two sets ofvariables:i. The Read set or input set Ri that consists of variablesread by the statement of instruction Ii.ii.The Write set or output set Wi that consists of variableswritten into by instruction Ii . Two instructions I1 and I2 can be executed parallelly ifthey satisfies the following conditions: R1 ∩ W2 = φ R2 ∩ W1 = φ W1 ∩ W2 = φ
  • 55.
    Data Dependence Threedifferent types of dependences: data dependences (also called true data dependences), name dependences and control dependences. Data Dependences True data dependence ( or flow dependence) Anti dependence Output dependence An instruction j is data dependent on instruction i if either ofthe following holds: Instruction i produces a result that may be used byinstruction j, or Instruction j is data dependent on instruction k andinstruction k is data dependent on instruction i
  • 56.
    Data Dependence Dependencesare a property of programs Pipeline organization determines if dependence is detected and if it causes a stall Data dependence conveys 3 things: Possibility of a hazard Order in which results must be calculated An upper bound on howmuch parallelism can possiblybe exploited. A dependence can be overcome in two different ways: Maintaining the dependence but avoiding a hazard Eliminating the dependence by transforming the code.
  • 57.
    Name Dependence Twoinstructions use the same name but no flow ofinformation associated with that name. Two types of Name Dependences between an instruction ithat precedes instruction j in program order 1) Antidependence: instruction j writes a register ormemory location that instruction i reads. The original ordering must be preserved to ensure thatinstruction i reads the correct value. 2) Output dependence: instruction i and instruction j writethe same register or memory location Ordering must be preserved to ensure that the valuefinally written corresponds to instruction j. To resolve name dependences, we use renaming techniques(register renaming)
  • 61.
    Data Hazards Ahazard is created whenever there is a dependencebetween instructions, and they are close enough that the overlap during executionwould change the order of access to the operand involvedin the dependence. Because of the dependence, we have to preserve theprogram order. Three types of Data Hazards Read after write (RAW) Write after write (WAW) Write after read (WAR)
  • 62.
    Data Hazards Readafter write (RAW) Instruction j tries to read a source before i writes it, so jincorrectly gets the old value. This hazard is the most common type It corresponds to a true data dependentce. Pgm order must be preserved to ensure that j receivesthe value from i . Write After Write (WAW) Instruction j tries to write an operand before it is writtenby i. The writes end up being performed in the wrong order. This corresponds to an output dependence
  • 63.
    Data Hazards WriteAfter Read (WAR) Instruction j tries to write a destination before it is readby i, so i incorrectly gets the new value. This hazard arises from an antidependence. Read After Read (RAR) case is not a hazard.
  • 64.
    Control Dependences Acontrol dependence determines the ordering of aninstruction, i, w.r.t. a branch instruction so that the instructioni executed in correct program order and only when it shouldbe. These control dependences must be preserved to preserveprogram order. One of the simplest examples of a control dependence is thedependence of the statements in the “then” part of an ifstatement on the branch.
  • 65.
    Control Dependences Forexample, in the code segment :if p1 {s1}if p2 {s2} S1 is control dependent on p1, and S2 is control dependenton p2 but not on p1.
  • 66.
    Control Dependences Ingeneral, two constraints are imposed by controldependences: An instruction that is control dependent on a branchcannot be moved before the branch so that its executionis no longer controlled by the branch. For example, we cannot take an instruction fromthe then portion of an if statement and move itbefore the if statement. An instruction that is not control dependent on a branchcannot be moved after the branch so that its executionis controlled by the branch. For example, we cannot take a statement beforethe if statement and move it into the then portion.
  • 67.
    Basic Compiler Techniquesfor Exposing ILP➢These techniques are crucial for processors that use staticscheduling.➢The basic compiler techniques includes:➢Scheduling the code➢Loop unrolling➢Reducing branch costs with advanced brachprediction
  • 68.
    Basic Pipeline Scheduling➢Tokeep a pipeline full,➢parallelism among instructions must be exploited by➢finding sequences of unrelated instructions thatcan be overlapped in the pipeline.➢To avoid a pipeline stall,➢the execution of a dependent instruction must be➢separated from the source instruction by adistance in clock cycles equal to➢The pipeline latency of that sourceinstruction.
  • 69.
    Basic Pipeline Scheduling➢Acompiler’s ability to perform this scheduling dependsboth on➢the amount of ILP available in the program and➢On the latencies of the functional units in thepipeline.Instructionproducing resultInstruction usingresultLatency inclock cyclesFP ALU op Another FP ALU op 3FP ALU op Store double 2Load double FP ALU op 1Load double Store double 0➢Latencies of FP operations used is given above.➢The last column is the number of intervening clock cyclesneeded to avoid a stall.
  • 70.
    Basic Pipeline Scheduling➢Weassume➢the standard five-stage integer pipeline, so thatbranches have a delay of one clock cycle.➢the functional units are fully pipelined or replicated (asmany times as the pipeline depth),➢so that an operation of any type can be issued onevery clock cycle and➢there are no structural hazards.➢The integer ALU operation latency of 0
  • 71.
    Basic Pipeline Scheduling➢Considerthe following code segment which adds a scalarto a vector:for (i=999; i>=0; i--)x[i] = x[i] + s ;➢This loop is parallel by noticing that the body of eachiteration is independent.➢The first step is to translate the above segment to MIPSassembly language.➢In the following code segment,➢R1 is initially the address of the element in the arraywith the highest address, and➢F2 contains the scalar value s.➢Register R2 is precomputed, so that 8(R2) is theaddress of the last element to operate on.
  • 72.
    Basic Pipeline Scheduling➢Thestraightforward MIPS code, not scheduled for thepipeline, looks like :➢Loop: L.D F0,0(R1) ;F0=array elementADD.D F4,F0,F2 ;add scalar in F2S.D F4,0(R1) ;store resultDADDUI R1,R1,#-8 ;decrement pointer;8 bytes (per DW)BNE R1,R2,Loop ;branch R1!=R2
  • 73.
    Basic Pipeline Scheduling➢Withoutany scheduling, the loop will execute as follows:Clock cycle issuedLoop: L.D F0,0(R1) 1stall 2ADD.D F4,F0,F2 3stall 4stall 5S.D F4,0(R1) 6DADDUI R1,R1,#-8 7stall 8BNE R1,R2,Loop 9
  • 74.
    Basic Pipeline Scheduling➢Wecan schedule the loop to obtain only two stalls andreduce the time to seven cycles:Clock cycle issuedLoop: L.D F0,0(R1) 1DADDUI R1,R1,#-8 2ADD.D F4,F0,F2 3stall 4stall 5S.D F4, 8(R1) 6BNE R1,R2,Loop 7➢Two stalls after ADD.D are for use by the S.D
  • 75.
    Basic Pipeline Scheduling➢Inthe previous example, we complete one loop iterationand store back one array element every seven clockcycles.➢The actual work of operating on the array element takesjust three (the load, add, and store) of those seven clockcycles.➢The remaining four clock cycles consist of➢loop overhead—the DADDUI and BNE—and➢two stalls.➢To eliminate these four clock cycles➢we need to get more operations relative to thenumber of overhead instructions.
  • 76.
    Loop Unrolling➢A simplescheme for increasing the number of instructionsrelative to the branch and overhead instructions is loopunrolling.➢Unrolling simply replicates the loop body multiple times,adjusting the loop termination code.➢Loop unrolling can also be used to improve scheduling.➢Because it eliminates the branch,➢it allows instructions from different iterations to bescheduled together
  • 77.
    Loop Unrolling➢If wesimply replicated the instructions when we unrolledthe loop,➢the resulting use of the same registers could preventus from effectively scheduling the loop.➢Thus, we will want to use different registers for eachiteration,➢increasing the required number of registers.
  • 78.
    Loop Unrolling withoutscheduling➢Here we assumes that the number of element is a multiple of 4.➢Note that R2 must now be set so that 32(R2) is the startingaddress of the last four elements
  • 79.
    Loop Unrolling withoutscheduling➢We have eliminated 3 branches and 3 decrements of R1 .➢Without scheduling, every operation in the unrolled loop isfollowed by a dependent operation and thus will cause astall.➢This loop will run in 27 clock cycles:➢each L.D has 1 stall, (1x4 =4)➢each ADDD has 2 stalls, (2x4 =8)➢the DADDUI has 1 stall, (1x1 =1)➢plus 14 instruction issue cycles➢Or (27/4)=6.75 clock cycles for each elements.➢This can be scheduled to improve performance significally.
  • 80.
  • 81.
    Loop Unrolling withscheduling➢The execution time of the unrolled loop has dropped to atotal of 14 clock cycles.➢or 3.5 clock cycles per element,➢compared with➢9 cycles per element before any unrolling orscheduling➢7 cycles when scheduled but not unrolled.➢6.75 cycles with unrolling but no scheduling
  • 82.
    Strip mining➢In realprograms we do not usually know the upper bound onthe loop.➢Suppose it is n➢we would like to unroll the loop to make k copies of the body.➢Instead of a single unrolled loop, we generate a pair ofconsecutive loops.➢The first executes (n mod k) times and has a body thatis the original loop.➢The second is the unrolled body surrounded by an outerloop that iterates (n/k) times➢For large values of n, most of the execution time will bespent in the unrolled loop body.
  • 83.
    Loop Unrolling➢Loop unrollingis a simple but useful method for➢increasing the size of straight-line code fragments thatcan be scheduled effectively.➢Three different effects limit the gains from loop unrolling:(1) a decrease in the amount of overhead amortized witheach unroll➢If the loop is unrolled double the times(2n), theoverhead is reduced to 1/2 the overhead ofunrolling of n times.(2) code size limitations➢growth in code size may increases instuction cachemiss rate(3) compiler limitations – shortfall in registers.- Register pressure
  • 84.
    Branch PredictionLoop unrollingis one way to reduce the number of branchhazards.We can also reduce the performance losses of branchesby predicting how they will behave.Branch prediction schemes are of two types:static branch prediction (or compile-time branchprediction)dynamic branch prediction
  • 85.
    Static Branch PredictionItis the simplest one, becauseit does not rely on information about the dynamichistory of code executing.It rely on information available at compile timeIt predicts the outcome of a branch based solely on thebranch instruction.i.e., uses information that was gathered before theexecution of the program.use profile information collected from earlier runs.
  • 86.
    Dynamic Branch PredictionPredictbranches dynamically based on program behavior.It uses information about taken or not taken branchesgathered at run-time to predict the outcome of a branch.The simplest dynamic branch-prediction scheme is a branch-prediction buffer or branch history table.A branch-prediction buffer is a small memory indexed by thelower portion of the address of the branch instruction.The memory location contains a bit that says whetherthe branch was recently taken or not.
  • 87.
    Dynamic Branch PredictionDifferentbranch instructions may have the same low-orderbits.With such a buffer we don’t know the prediction is correctThe prediction is a hint that is assumed to be correct, andfetching begins in the predicted direction.If the hint turns out to be wrong, the prediction bit isinverted and stored back.This simple 1-bit prediction scheme has a performanceshortcoming:Even if a branch is almost always taken, we will likelypredict incorrectly twice, rather than once, when it is nottakensince the misprediction causes the prediction bit tobe flipped.
  • 88.
    Dynamic Branch Prediction2-bitPrediction Scheme :-To overcome the weakness of 1-bit prediction scheme,2-bit prediction schemes are often used.In a 2-bit scheme, a prediction must miss twice before it ischanged.Fig shows the finite-state diagram for a 2-bit predictionscheme.
  • 89.
    Dynamic Branch PredictionCorrelatingBranch Predictors :-The 2-bit predictor schemes use only the recent behavior ofa single branchto predict the future behavior of that branch.It may be possible to improve the prediction accuracyif we also look at the recent behavior of otherbranches rather than just the branch we are trying topredict.Branch predictors that use the behavior of other branches tomake a prediction are called correlating predictors or two-level predictors.
  • 90.
    Dynamic Branch PredictionCorrelatingBranch Predictors :-Consider the following code :if (aa == 2) // branch b1aa=0;if (bb==2) // branch b2bb=0;if (aa!=bb) { // branch b3........}The behavior of branch b3 is correlated with the behavior ofbranches b1 and b2.If branches b1 and b2 are both not taken then branch b3 willbe taken.
  • 91.
    Dynamic Branch PredictionCorrelatingBranch Predictors :-A predictor that uses only the behavior of a single branch topredict the outcome of that branch can never capture thisbehavior.Existing correlating predictors add information about thebehavior of the most recent branches to decide how topredict a given branch.For example, a (1,2) predictor usesthe behavior of the last branch to choose from among apair of 2-bit branch predictors in predicting a particularbranch.
  • 92.
    Dynamic Branch PredictionCorrelatingBranch Predictors :-In general case an (m, n) predictor usesthe behavior of the last m branches to choose from 2mbranch predictors,each of which is an n-bit predictor for a singlebranch.The attraction of this type of correlating branch predictor isthat it can yield higher prediction rates than the 2-bitscheme andrequires only a trivial amount of additional hardware.
  • 93.
    Dynamic Branch PredictionCorrelatingBranch Predictors :-The global history of the most recent m branches can berecorded in an m-bit shift register,where each bit records whether the branch wastaken or not taken.The branch prediction buffer can then be indexed usinga concatenation of the low-order bits from thebranch address with the m-bit global history.
  • 94.
    Dynamic Branch PredictionCorrelatingBranch Predictors :-For example, in a (2, 2) buffer with 64 total entries,the 4 low-order address bits of the branch (word address)andthe 2 global bits representing the behavior of the twomost recently executed branchesform a 6-bit index that can be used to index the 64counters.The number of bits in an (m, n) predictor is2m× n × Number of prediction entries selected by thebranch addressA 2-bit predictor with no global history is simply a (0,2)predictor.
  • 95.
  • 96.
    Dynamic Branch PredictionTournamentPredictors :-Tournament predictors usesmultiple predictors,usually one based on global information andone based on local information, andcombining them with a selector.Tournament predictors can achieve bothbetter accuracy at medium sizes (8K–32K bits) andalso make use of very large numbers of prediction bitseffectively.
  • 97.
    Dynamic Branch PredictionTournamentPredictors :-Existing tournament predictors use a 2-bit saturating counterper branchto choose among two different predictors based onwhich predictor (local, global, or even some mix) wasmost effective in recent predictions.As in a simple 2-bit predictor,the saturating counter requires two mispredictionsbefore changing the identity of the preferred predictor.
  • 98.
    Dynamic Branch PredictionTournamentPredictors :-The advantage of a tournament predictor isits ability to select the right predictor for a particularbranch.
  • 99.
    Dynamic Branch PredictionFig:The misprediction rate for three different predictors onSPEC89(benchmark) as the total number of bits is increased.
  • 100.
    Speculation overcome controldependencybyPredicting branch outcome andSpeculatively executing instructions as ifpredictions were correct.Hardware Based Speculation
  • 101.
    Hardware-based speculation combinesthree key ideas:1) dynamic branch prediction to choose which instructionsto execute2) speculation to allow the execution of instructions beforethe control dependences are resolved (with the abilityto undo the effects of an incorrectly speculatedsequence)3) dynamic scheduling to deal with the scheduling ofdifferent combinations of basic blocks.Hardware Based Speculation
  • 102.
    Hardware-based speculation followsthe predicted flow ofdata values to choose when to execute instructions.This method of executing programs is essentially a dataflow execution: Operations execute as soon as theiroperands are available.Hardware Based Speculation
  • 103.
    The key ideabehind implementing speculation is toallow instructions to execute out of orderbut to force them to commit in order andto prevent any irrevocable action (such as updatingstate or taking an exception) until an instructioncommits.Hence, when we add speculation,we need to separate the process of completingexecution from instruction commit,since instructions may finish executionconsiderably before they are ready to commit.Hardware Based Speculation
  • 104.
    Adding the commitphase to the instruction executionsequencerequires an additional set of hardware buffers thathold the results of instructions that have finishedexecution but have not committed.This hardware buffer, reorder buffer, is also used to passresults among instructions that may be speculated.Hardware Based Speculation
  • 105.
    • The reorderbuffer (ROB) provides additionalregisters.• The ROB holds the result of an instructionbetween the time the operation associated withthe instruction completes and the time theinstruction commits.• Hence, the ROB is a source of operands forinstructions.Reorder Buffer (ROB)
  • 106.
    • With speculation,the register file is notupdated until the instruction commits ;• thus, the ROB supplies operands in theinterval between completion ofinstruction execution and instructioncommit.Reorder Buffer (ROB)
  • 107.
    Each entry inthe ROB contains four fields:the instruction type,the destination field,the value field, andthe ready field.The instruction type field indicates whether the instructionisa branch (and has no destination result),a store (which has a memory address destination), ora register operation (ALU operation or load, whichhas register destinations).Reorder Buffer (ROB)
  • 108.
    The destination fieldsuppliesthe register number (for loads and ALU operations) orthe memory address (for stores) where the instructionresult should be written.The value field is usedto hold the value of the instruction result until theinstruction commits.The ready fieldIndicates that the instruction has completed execution,and the value is ready.Reorder Buffer (ROB)
  • 109.
    Basic Structure withH/W Based Speculation
  • 110.
    There are thefour steps involved in instruction execution:IssueExecuteWrite resultCommitSteps in Execution
  • 111.
    IssueGet an instructionfrom the instruction queue.Issue the instruction if there is an empty reservationstation and an empty slot in the ROB;send the operands to the reservation station ifthey are available in either the registers or theROB.Update the control entries to indicate the buffers arein use.The number of the ROB entry allocated for the resultis also sent to the reservation station, so that thenumber can be used to tag the result when it isplaced on the CDB (Common Data Bus).Steps in Execution
  • 112.
    IssueIf either allreservations are full or the ROB is full,then instruction issue is stalled until both haveavailable entries.Write Result :When the result is available,write it on the CDB (with the ROB tag sentwhen the instruction issued) and from the CDBinto the ROB, as well as to any reservationstations waiting for this result.Mark the reservation station as available.Steps in Execution
  • 113.
    Write Result :Specialactions are required for store instructions.If the value to be stored is available,it is written into the Value field of the ROB entryfor the store.If the value to be stored is not available yet,the CDB must be monitored until that value isbroadcast, at which time the Value field of theROB entry of the store is updated.Steps in Execution
  • 114.
    Commit :This isthe final stage of completing an instruction,after which only its result remains.There are three different sequences of actions atcommit depending on whether the committinginstruction isa branch with an incorrect prediction,a store, orany other instruction (normal commit)Steps in Execution
  • 115.
    Commit :The normalcommit case occurs when an instructionreaches the head of the ROB and its result is presentin the buffer;at this point, the processor updates the registerwith the result andremoves the instruction from the ROB.Committing a store is similar except thatmemory is updated rather than a result register.Steps in Execution
  • 116.
    Commit :When abranch with incorrect prediction reaches thehead of the ROB, it indicates that the speculation waswrong.The ROB is flushed and execution is restarted at thecorrect successor of the branch.If the branch was correctly predicted, the branch isfinished.Steps in Execution
  • 117.
    Once an instructioncommits,its entry in the ROB is reclaimed andthe register or memory destination is updated,eliminating the need for the ROB entry.If the ROB is filled, we simply stop issuing instructionsuntil an entry is made free.Steps in Execution
  • 118.
    Multithreading: Exploiting Thread-LevelParallelismto Improve Uniprocessor Throughputallows multiple threads to share thefunctional units of a single processor in an overlappingfashion.In contrast, a more general method to exploit thread-level parallelism (TLP) is with a multiprocessor that hasmultiple independent threads operating at once and inparallel.Multithreading, however, does not duplicate the entireprocessor as a multiprocessor does.Instead, multithreading shares most of the processorcore among a set of threads, duplicating only.
  • 119.
    contd..• Duplicating theper-thread state of a processor coremeans creating a separate register file, a separate PC,and a separate page table for each thread.• There are three main hardware approaches tomultithreading.1. Fine-grained multithreading switches between threadson each clock, causing the execution of instructionsfrom multiple threads to be interleaved.2. Coarse-grained multithreading switches threads onlyon costly stalls, such as level two or three cachemisses.3. Simultaneous multithreading is a variation on finegrained multithreading that arises naturally when fine-grained multithreading is implemented on top of amultiple-issue, dynamically scheduled processor.
  • 120.
    . The horizontaldimension represents theinstruction execution capability in each clock cycle. The verticaldimension represents a sequence of clock cycles. An empty (white) boxindicates that the corresponding execution slot is unused in that clockcycle. The shades of gray and black correspond to four different threadsin the multithreading processors.
  • 121.

[8]ページ先頭

©2009-2025 Movatter.jp