Movatterモバイル変換

Jump to content

Out-of-order execution

From Wikipedia, the free encyclopedia

Computing paradigm to improve computational efficiency

"OOE" redirects here. For other uses, seeOoe (disambiguation).

Incomputer engineering,out-of-order execution (or more formallydynamic execution) is aninstruction scheduling paradigm used in high-performancecentral processing units to make use ofinstruction cycles that would otherwise be wasted. In this paradigm, a processor executesinstructions in an order governed by the availability of input data and execution units,^[1] rather than by their original order in a program.^[2]^[3] In doing so, the processor can avoid being idle while waiting for the preceding instruction to complete and can, in the meantime, process the next instructions that are able to run immediately and independently.^[4]

History

Out-of-order execution is a restricted form ofdataflow architecture, which was a major research area incomputer architecture in the 1970s and early 1980s.

Early use in supercomputers

Arguably the first machine to use out-of-order execution is theCDC 6600 (1964), which used ascoreboard to resolve conflicts. The 6600 however lackedWAW conflict handling, choosing instead to stall. This situation was termed a "First Order Conflict" by Thornton.^[5] Whilst it had bothRAW conflict resolution (termed "Second Order Conflict"^[6]) andWAR conflict resolution (termed "Third Order Conflict"^[7]) all of which is sufficient to declare it capable of full out-of-order execution, the 6600 did not have precise exception handling. An early and limited form of Branch prediction was possible as long as the branch was to locations on what was termed the "Instruction Stack" which was limited to within a depth of seven words from the Program Counter.^[8]

About two years later, theIBM System/360 Model 91 (1966) introducedregister renaming withTomasulo's algorithm,^[9] which dissolves false dependencies (WAW and WAR), making full out-of-order execution possible. An instruction addressing a write into a registerr_n can be executed before an earlier instruction using the registerr_n is executed, by actually writing into an alternative (renamed) registeralt-r_n, which is turned into a normal registerr_n only when all the earlier instructions addressingr_n have been executed, but until thenr_n is given for earlier instructions andalt-r_n for later ones addressingr_n.

In the Model 91 the register renaming is implemented by abypass termedCommon Data Bus (CDB) and memory source operand buffers, leaving the physical architectural registers unused for many cycles as the oldest state of registers addressed by any unexecuted instruction is found on the CDB. Another advantage the Model 91 has over the 6600 is the ability to execute instructions out-of-order in the sameexecution unit, not just between the units like the 6600^{[disputed –discuss]}. This is accomplished byreservation stations, from which instructions go to the execution unit when ready, as opposed to the FIFO queue^{[disputed –discuss]} of each execution unit of the 6600. The Model 91 is also capable of reordering loads and stores to execute before the preceding loads and stores,^[10] unlike the 6600, which only has a limited ability to move loads past loads, and stores past stores, but not loads past stores and stores past loads.^[11] Only the floating-point registers of the Model 91 are renamed, making it subject to the same WAW and WAR limitations as the CDC 6600 when running fixed-point calculations. The 91 and 6600 both also suffer fromimprecise exceptions, which needed to be solved before out-of-order execution could be applied generally and made practical outside supercomputers.

Precise exceptions

To haveprecise exceptions, the proper in-order state of the program's execution must be available upon an exception. By 1985 various approaches were developed as described byJames E. Smith and Andrew R. Pleszkun.^[12] TheCDC Cyber 205 was a precursor, as upon a virtual memory interrupt the entire state of the processor (including the information on the partially executed instructions) is saved into aninvisible exchange package, so that it can resume at the same state of execution.^[13] However to make all exceptions precise, there has to be a way to cancel the effects of instructions. The CDC Cyber 990 (1984) implements precise interrupts by using a history buffer, which holds the old (overwritten) values of registers that are restored when an exception necessitates the reverting of instructions.^[12] Through simulation, Smith determined that adding a reorder buffer (or history buffer or equivalent) to theCray-1S would reduce the performance of executing the first 14Livermore loops (unvectorized) by only 3%.^[12] Important academic research in this subject was led byYale Patt with hisHPSm simulator.^[14]

In the 1980s many earlyRISC microprocessors, had out-of-order writeback to the registers, invariably resulting in imprecise exceptions. TheMotorola 88100 was one of the few early microprocessors that did not suffer from imprecise exceptions despite out-of-order writes, although it did allow both precise and imprecise floating-point exceptions.^[15] Instructions started execution in order, but some (e.g. floating-point) took more cycles to complete execution. However, the single-cycle execution of the most basic instructions greatly reduced the scope of the problem compared to the CDC 6600.

Decoupling

Smith also researched how to make different execution units operate more independently of each other and of the memory, front-end, and branching.^[16] He implemented those ideas in theAstronautics ZS-1 (1988), featuring a decoupling of the integer/load/storepipeline from the floating-point pipeline, allowing inter-pipeline reordering. The ZS-1 was also capable of executing loads ahead of preceding stores. In his 1984 paper, he opined that enforcing the precise exceptions only on the integer/memory pipeline should be sufficient for many use cases, as it even permitsvirtual memory. Each pipeline had an instruction buffer to decouple it from the instruction decoder, to prevent the stalling of the front end. To further decouple the memory access from execution, each of the two pipelines was associated with two addressablequeues that effectively performed limited register renaming.^[10] A similar decoupled architecture had been used a bit earlier in the Culler 7.^[17] The ZS-1's ISA, like IBM's subsequent POWER, aided the early execution of branches.

Research comes to fruition

With thePOWER1 (1990), IBM returned to out-of-order execution. It was the first processor to combine register renaming (though again only floating-point registers) with precise exceptions. It uses aphysical register file (i.e. a dynamically remapped file with both uncommitted and committed values) instead of a reorder buffer, but the ability to cancel instructions is needed only in the branch unit, which implements a history buffer (namedprogram counter stack by IBM) to undo changes to count, link, and condition registers. The reordering capability of even the floating-point instructions is still very limited; due to POWER1's inability to reorder floating-point arithmetic instructions (results became available in-order), their destination registers aren't renamed. POWER1 also doesn't havereservation stations needed for out-of-order use of the same execution unit.^[18]^[19] The next year IBM'sES/9000 model 900 had register renaming added for the general-purpose registers. It also hasreservation stations with six entries for the dual integer unit (each cycle, from the six instructions up to two can be selected and then executed) and six entries for the FPU. Other units have simple FIFO queues. The reordering distance is up to 32 instructions.^[20] The A19 ofUnisys'A-series of mainframes was also released in 1991 and was claimed to have out-of-order execution, and one analyst called the A19's technology three to five years ahead of the competition.^[21]^[22]

Wide adoption

The firstsuperscalar single-chip processors (Intel i960CA in 1989) used a simple scoreboarding scheduling like the CDC 6600 had a quarter of a century earlier. In 1992–1996 a rapid advancement of techniques, enabled byincreasing transistor counts, saw proliferation down topersonal computers. TheMotorola 88110 (1992) used a history buffer to revert instructions.^[23] Loads could be executed ahead of preceding stores. While stores and branches were waiting to start execution, subsequent instructions of other types could keep flowing through all the pipeline stages, including writeback. The 12-entry capacity of the history buffer placed a limit on the reorder distance.^[24]^[25]^[26] ThePowerPC 601 (1993) was an evolution of theRISC Single Chip, itself a simplification of POWER1. The 601 permitted branch and floating-point instructions to overtake the integer instructions already in the fetched instruction queue, the lowest four entries of which were scanned for dispatchability. In the case of a cache miss, loads and stores could be reordered. Only the link and count registers could be renamed.^[32] In the fall of 1994NexGen andIBM with Motorola brought the renaming of general-purpose registers to single-chip CPUs. NexGen's Nx586 was the firstx86 processor capable of out-of-order execution and featured a reordering distance of up to 14micro-operations.^[33] ThePowerPC 603 renamed both the general-purpose and FP registers. Each of the four non-branch execution units can have one instruction wait in front of it without blocking the instruction flow to the other units. A five-entryreorder buffer lets no more than four instructions overtake an unexecuted instruction. Due to a store buffer, a load can access cache ahead of a preceding store.^[34]^[35]

PowerPC 604 (1995) was the first single-chip processor withexecution unit-level reordering, as three out of its six units each had a two-entry reservation station permitting the newer entry to execute before the older. The reorder buffer capacity is 16 instructions. A four-entry load queue and a six-entry store queue track the reordering of loads and stores upon cache misses.^[36]HAL SPARC64 (1995) exceeded the reordering capacity of theES/9000 model 900 by having three 8-entry reservation stations for integer, floating-point, andaddress generation unit, and a 12-entry reservation station for load/store, which permits greater reordering of cache/memory access than preceding processors. Up to 64 instructions can be in a reordered state at a time.^[37]^[38]Pentium Pro (1995) introduced aunified reservation station, which at the 20 micro-OP capacity permitted very flexible reordering, backed by a 40-entry reorder buffer. Loads can be reordered ahead of both loads and stores.^[39]

The practically attainableper-cycle rate of execution rose further as full out-of-order execution was further adopted bySGI/MIPS (R10000) andHP PA-RISC (PA-8000) in 1996. The same yearCyrix 6x86 andAMD K5 brought advanced reordering techniques into mainstream personal computers. SinceDEC Alpha gained out-of-order execution in 1998 (Alpha 21264), the top-performing out-of-order processor cores have been unmatched by in-order cores other thanHP/Intel Itanium 2 andIBM POWER6, though the latter had an out-of-orderfloating-point unit.^[40] The other high-end in-order processors fell far behind, namelySun'sUltraSPARC III/IV, and IBM'smainframes which had lost the out-of-order execution capability for the second time, remaining in-order into thez10 generation. Later big in-order processors were focused on multithreaded performance, but eventually theSPARC T series andXeon Phi changed to out-of-order execution in 2011 and 2016 respectively.^{[citation needed]}

Almost all processors for phones and other lower-end applications remained in-order untilc. 2010. First,Qualcomm'sScorpion (reordering distance of 32) shipped inSnapdragon,^[41] and a bit laterArm'sA9 succeededA8. For low-endx86 personal computers in-orderBonnell microarchitecture in earlyIntel Atom processors were first challenged byAMD'sBobcat microarchitecture, and in 2013 were succeeded by an out-of-orderSilvermont microarchitecture.^[42] Because the complexity of out-of-order execution precludes achieving the lowest minimum power consumption, cost and size, in-order execution is still prevalent inmicrocontrollers andembedded systems, as well as in phone-class cores such as Arm'sA55 andA510 inbig.LITTLE configurations.

Basic concept

Background

Out-of-order execution is more sophisticated relative to the baseline of in-order execution. In pipelined in-order execution processors, execution of instructions overlap in pipelined fashion with each requiring multipleclock cycles to complete. The consequence is that results from a previous instruction will lag behind where they may be needed in the next. In-order execution still has to keep track of these dependencies. Its approach is however quite unsophisticated: stall, every time. Out-of-order uses much more sophisticated data tracking techniques, as described below.

In-order processors

In earlier processors, the processing of instructions is performed in aninstruction cycle normally consisting of the following steps:

Instruction fetch.
If inputoperands are available (in processor registers, for instance), the instruction is dispatched to the appropriatefunctional unit. If one or more operands are unavailable during the current clock cycle (generally because they must be fetched frommemory), the processor stalls until they are available.
The instruction is executed by the appropriate functional unit.
The functional unit writes the results back to theregister file.

Often, an in-order processor has abit vector recording which registers will be written to by a pipeline.^[43] If any input operands have the corresponding bit set in this vector, the instruction stalls. Essentially, the vector performs a greatly simplified role of protecting against register hazards. Thus out-of-order execution uses 2D matrices whereas in-order execution uses a 1D vector for hazard avoidance.

Out-of-order processors

This new paradigm breaks up the processing of instructions into these steps:^[44]

Instruction fetch.
Instruction decoding.
Instruction renaming.
Instruction dispatch to an instruction queue (also called instruction buffer orreservation stations).
The instruction waits in the queue until its input operands are available. The instruction can leave the queue before older instructions.
The instruction is issued to the appropriate functional unit and executed by that unit.
The results are queued.
Only after all older instructions have their results written back to the register file, then this result is written back to the register file. This is called the graduation or retire stage.

The key concept of out-of-order processing is to allow the processor to avoid a class of stalls that occur when the data needed to perform an operation are unavailable. In the outline above, the processor avoids the stall that occurs in step 2 of the in-order processor when the instruction is not completely ready to be processed due to missing data.

Out-of-order processors fill theseslots in time with other instructions thatare ready, then either reorder the results at the end to make it appear that the instructions were processed as normal, record and thus apply the originalprogram order, or commit in uninterruptible batches where order will not cause data corruption.The way the instructions are ordered in the original computer code is known asprogram order, in the processor they are handled indata order, the order in which the data becomes available in the processor's registers. Fairly complex circuitry is needed to convert from one ordering to the other and maintain a logical ordering of the output.

The benefit of out-of-order processing grows as theinstruction pipeline deepens and the speed difference betweenmain memory (orcache memory) and the processor widens. On modern machines, the processor runs many times faster than the memory, so during the time an in-order processor spends waiting for data to arrive, it could have theoretically processed a large number of instructions.

Dispatch and issue decoupling allows out-of-order issue

One of the differences created by the new paradigm is the creation of queues that allow the dispatch step to be decoupled from the issue step and the graduation stage to be decoupled from the execute stage. An early name for the paradigm wasdecoupled architecture. In the earlierin-order processors, these stages operated in a fairlylock-step, pipelined fashion.

Thefetch and decode stages is separated from the execute stage in apipelined processor by using abuffer. The buffer's purpose is to partition thememory access and execute functions in a computer program and achieve high performance by exploiting the fine-grainparallelism between the two.^[45] In doing so, it effectively hides allmemory latency from the processor's perspective.

A larger buffer can, in theory, increase throughput. However, if the processor has abranch misprediction then the entire buffer may need to be flushed, wasting a lot ofclock cycles and reducing the effectiveness. Furthermore, larger buffers create more heat and use moredie space. For this reason processor designers today favor amulti-threaded design approach.

Decoupled architectures are generally thought of as not useful for general-purpose computing as they do not handle control-intensive code well.^[46] Control intensive code include such things as nested branches that occur frequently inoperating system kernels. Decoupled architectures play an important role in scheduling invery long instruction word (VLIW) architectures.^[47]

Execute and writeback decoupling allows program restart

The queue for results is necessary to resolve issues such as branch mispredictions and exceptions. The results queue allows programs to be restarted after an exception and for the instructions to be completed in program order. The queue allows results to be discarded due to mispredictions on older branch instructions and exceptions taken on older instructions. The ability to issue instructions past branches that have yet to be resolved is known asspeculative execution.

Micro-architectural choices

Are the instructions dispatched to a centralized queue or to multiple distributed queues?

IBM PowerPC processors use queues that are distributed among the different functional units while other out-of-order processors use a centralized queue. IBM uses the termreservation stations for their distributed queues.

Is there an actual results queue or are the results written directly into a register file? For the latter, the queueing function is handled by register maps that hold the register renaming information for each instruction in flight.

Early Intel out-of-order processors use a results queue called areorder buffer,^[a] while most later out-of-order processors use register maps.^[b]

See also

The WikibookMicroprocessor Design has a page on the topic of:Out Of Order Execution

Notes

^IntelP6 family microprocessors have both a reorder buffer (ROB) and aregister alias table (RAT). The ROB was motivated mainly by branch misprediction recovery. The IntelP6 family is among the earliest out-of-order microprocessors but were supplanted by theNetBurst architecture. Years later, NetBurst proved to be a dead end due to its long pipeline that assumed the possibility of much higher operating frequencies. Materials were not able to match the design's ambitious clock targets due to thermal issues and later designs based on NetBurst, namely Tejas and Jayhawk, were cancelled. Intel reverted to the P6 design as the basis of theCore andNehalem microarchitectures.
^The succeedingSandy Bridge,Ivy Bridge, andHaswell microarchitectures are a departure from the reordering techniques used in P6 and employ reordering techniques from theEV6 and theP4 but with a somewhat shorter pipeline.^[48]^[49]

References

^Kukunas, Jim (2015).Power and Performance: Software Analysis and Optimization. Morgan Kaufman. p. 37.ISBN 9780128008140.
^"Out-of-order execution"(PDF). cs.washington.edu. 2006. Retrieved2014-01-17.don't wait for previous instructions to execute if this instruction does not depend on them
^"The Centennial Celebration".Regis High School. 2011-03-14. Retrieved2022-06-25.The algorithm "allows sequential instructions that would normally be stalled due to certain dependencies to execute non-sequentially" (also known as out of order execution).^{[dead link]}
^Kozierok, Charles M. (April 17, 2001)."Out-of-order Execution". The PC Guide. Archived fromthe original on 2019-02-18. Retrieved2014-01-17.This flexibility improves performance since it allows execution with less 'waiting' time.
^Thornton (1970, p. 125)
^Thornton (1970, p. 126)
^Thornton 1970, p. 127
^Thornton 1970, p. 112,123
^Tomasulo, Robert Marco (1967),"An Efficient Algorithm for Exploiting Multiple Arithmetic Units"(PDF),IBM Journal of Research and Development,11 (1):25–33,CiteSeerX 10.1.1.639.7540,doi:10.1147/rd.111.0025,S2CID 8445049, archived fromthe original(PDF) on 2018-06-12
^^a ^bSmith, James E. (July 1989)."Dynamic Instruction Scheduling and the Astronautics ZS-1"(PDF).Computer.22 (7):21–35.doi:10.1109/2.30730.S2CID 329170.
^Thornton (1970, p. 48-50)
^^a ^b ^cSmith, James E.; Pleszkun, Andrew R. (June 1985)."Implementation of precise interrupts in pipelined processors".12th ISCA.
(Expanded version published in May 1988 asImplementing Precise Interrupts in Pipelined Processors.)
^Moudgill, Mayan; Vassiliadis, Stamatis (January 1996)."On Precise Interrupts". p. 18.CiteSeerX 10.1.1.33.3304. Archived fromthe original(pdf) on 13 October 2022.
^Hwu, W.;Patt, Yale N. (1986).HPSm, a high performance restricted data flow architecture having minimal functionality.ACM. pp. 297–306.ISBN 978-0-8186-0719-6. Retrieved2013-12-06.{{cite book}}:|work= ignored (help)
^"MC88100 RISC Microprocessor - User's manual Second edition"(PDF).www.bitsavers.org.
^Smith, James E. (November 1984)."Decoupled Access/Execute Computer Architectures"(PDF).ACM Transactions on Computer Systems.2 (4):289–308.doi:10.1145/357401.357403.S2CID 13903321.
^Smotherman, Mark."Culler-7".Clemson University.
^Grohoski, Gregory F. (January 1990)."Machine organization of the IBM RISC System/6000 processor"(PDF).IBM Journal of Research and Development.34 (1):37–58.doi:10.1147/rd.341.0037. Archived fromthe original(PDF) on January 9, 2005.
^Smith, James E.; Sohi, Gurindar S. (December 1995)."The Microarchitecture of Superscalar Processors"(PDF).Proceedings of the IEEE.83 (12): 1617.doi:10.1109/5.476078.
^Liptay, John S. (July 1992)."Design of the IBM Enterprise System/9000 high-end processor"(PDF).IBM Journal of Research and Development.36 (4):713–731.doi:10.1147/rd.364.0713. Archived fromthe original(PDF) on January 17, 2005.
^Ziegler, Bart (March 7, 1991)."Unisys Unveils 'Top Gun' Mainframe Computers".AP News.
^"Unisys' New Mainframe Leaves Big Blue In The Dust".Bloomberg. March 25, 1991.The new A19 relies on "super-scalar" techniques from scientific computers to execute many instructions concurrently. The A19 can overlap as many as 140 operations, more than 10 times as many as conventional mainframes can.
^Ullah, Nasr; Holle, Matt (March 1993)."The MC88110 Implementation of Precise Exceptions in a Superscalar Architecture"(pdf).ACM SIGARCH Computer Architecture News.21. Motorola Inc.:15–25.doi:10.1145/152479.152482.S2CID 7036627.
^Smotherman, Mark (29 April 1994)."Motorola MC88110 Overview".
^Diefendorff, Keith; Allen, Michael (April 1992)."Organization of the Motorola 88110 superscalar RISC microprocessor"(PDF).IEEE Micro.12 (2):40–63.doi:10.1109/40.127582.S2CID 25668727. Archived fromthe original(PDF) on 2022-10-21.
^Smotherman, Mark; Chawla, Shuchi; Cox, Stan; Malloy, Brian (December 1993)."Instruction scheduling for the Motorola 88110".Proceedings of the 26th Annual International Symposium on Microarchitecture. pp. 257–262.doi:10.1109/MICRO.1993.282761.ISBN 0-8186-5280-2.S2CID 52806289.
^"PowerPC™ 601 RISC Microprocessor Technical Summary"(PDF). Retrieved23 October 2022.
^Moore, Charles R.; Becker, Michael C. et al."The PowerPC 601 microprocessor".IEEE Micro.13 (5). September 1993.
^Diefendorff, Keith (August 1993)."PowerPC 601 Microprocessor"(PDF).Hot Chips.
^Smith, James E.; Weiss, Shlomo (June 1994)."PowerPC 601 and Alpha 21064: A Tale of Two RISCs"(PDF).IEEE Computer.27 (6):46–58.doi:10.1109/2.294853.S2CID 1114841.
^Sima, Dezsö (September–October 2000)."The design space of register renaming techniques".IEEE Micro.20 (5):70–83.CiteSeerX 10.1.1.387.6460.doi:10.1109/40.877952.S2CID 11012472.
^^[27]^[28]^[29]^[30]^[31]
^Gwennap, Linley (28 March 1994)."NexGen Enters Market with 66-MHz Nx586"(PDF).Microprocessor Report. Archived fromthe original(PDF) on 2 December 2021.
^Burgess, Brad; Ullah, Nasr; Van Overen, Peter; Ogden, Deene (June 1994)."The PowerPC 603 microprocessor".Communications of the ACM.37 (6):34–42.doi:10.1145/175208.175212.S2CID 34385975.
^"PowerPC™ 603 RISC Microprocessor Technical Summary"(PDF). Retrieved27 October 2022.
^Song, S. Peter; Denman, Marvin; Chang, Joe (October 1994)."The PowerPC 604 RISC microprocessor"(PDF).IEEE Micro.14 (5): 8.doi:10.1109/MM.1994.363071.S2CID 11603864.
^"SPARC64+: HAL's Second Generation 64-bit SPARC Processor"(PDF).Hot Chips.
^"Le Sparc64".Research Institute of Computer Science and Random Systems (in French).
^Gwennap, Linley (16 February 1995)."Intel's P6 Uses Decoupled Superscalar Design"(PDF).Microprocessor Report.
^Le, Hung Q. et al."IBM POWER6 microarchitecture"(PDF).IBM Journal of Research and Development.51 (6). November 2007.
^Mallia, Lou."Qualcomm High Performance Processor Core and Platform for Mobile Applications"(PDF). Archived fromthe original(PDF) on 29 October 2013.
^Anand Lal Shimpi (2013-05-06)."Intel's Silvermont Architecture Revealed: Getting Serious About Mobile".AnandTech. Archived fromthe original on December 22, 2016.
^"Inside the Minor CPU model: Scoreboard". 2017-06-09. Retrieved2023-01-09.
^González, Antonio; Latorre, Fernando; Magklis, Grigorios (2011)."Processor Microarchitecture".Synthesis Lectures on Computer Architecture.doi:10.1007/978-3-031-01729-2.ISBN 978-3-031-00601-2.ISSN 1935-3235.
^Smith, J. E. (1984). "Decoupled access/execute computer architectures".ACM Transactions on Computer Systems.2 (4):289–308.CiteSeerX 10.1.1.127.4475.doi:10.1145/357401.357403.S2CID 13903321.
^Kurian, L.; Hulina, P. T.; Coraor, L. D. (1994)."Memory latency effects in decoupled architectures"(PDF).IEEE Transactions on Computers.43 (10):1129–1139.doi:10.1109/12.324539.S2CID 6913858. Archived fromthe original(PDF) on 2018-06-12.
^Dorojevets, M. N.; Oklobdzija, V. (1995)."Multithreaded decoupled architecture".International Journal of High Speed Computing.7 (3):465–480.doi:10.1142/S0129053395000257.
^Kanter, David (2010-09-25)."Intel's Sandy Bridge Microarchitecture".
^"The Haswell Front End - Intel's Haswell Architecture Analyzed: Building a New PC and a New Intel". Archived fromthe original on October 7, 2012.

Thornton, James (1970).Design of a Computer: The Control Data 6600(PDF).ISBN 9780673059536.

Further reading

Smith, James E.; Pleszkun, A. R. (June 1985). "Implementation of precise interrupts in pipelined processors".ACM SIGARCH Computer Architecture News.13 (3):36–44.doi:10.1145/327070.327125.

v
t
e

Processor technologies

Instruction set
architectures

Types	Orthogonal instruction set CISC RISC Application-specific EDGE TRIPS VLIW EPIC MISC OISC NISC ZISC VISC architecture Quantum computing Comparison Addressing modes
Instruction sets	Motorola 68000 series VAX PDP-11 x86 ARM Stanford MIPS MIPS MIPS-X Power POWER PowerPC Power ISA Clipper architecture SPARC SuperH DEC Alpha ETRAX CRIS M32R Unicore Itanium OpenRISC RISC-V MicroBlaze LMC System/3x0 S/360 S/370 S/390 z/Architecture Tilera ISA VISC architecture Epiphany architecture Others

Instruction pipelining	Pipeline stall Operand forwarding Classic RISC pipeline
Hazards	Data dependency Structural Control False sharing
Out-of-order	Scoreboarding Tomasulo's algorithm Reservation station Re-order buffer Register renaming Wide-issue
Speculative	Branch prediction Memory dependence prediction

Level	Bit Bit-serial Word Instruction Pipelining Scalar Superscalar Task Thread Process Data Vector Memory Distributed
Multithreading	Temporal Simultaneous Hyperthreading Simultaneous and heterogenous Speculative Preemptive Cooperative
Flynn's taxonomy	SISD SIMD Array processing (SIMT) Pipelined processing Associative processing SWAR MISD MIMD SPMD

Processor
performance

Transistor count
Instructions per cycle (IPC)
- Cycles per instruction (CPI)
Instructions per second (IPS)
Floating-point operations per second (FLOPS)
Transactions per second (TPS)
Synaptic updates per second (SUPS)
Performance per watt (PPW)
Cache performance metrics
Computer performance by orders of magnitude

By application	Embedded system Microprocessor Microcontroller Mobile Ultra-low-voltage ASIP Soft microprocessor
Systems on chip	System on a chip (SoC) Multiprocessor (MPSoC) Cypress PSoC Network on a chip (NoC)
Hardware accelerators	Coprocessor AI accelerator Graphics processing unit (GPU) Image processor Vision processing unit (VPU) Physics processing unit (PPU) Digital signal processor (DSP) Tensor Processing Unit (TPU) Secure cryptoprocessor Network processor Baseband processor

Core count

Components

Functional units	Arithmetic logic unit (ALU) Address generation unit (AGU) Floating-point unit (FPU) Memory management unit (MMU) Load–store unit Translation lookaside buffer (TLB) Branch predictor Branch target predictor Integrated memory controller (IMC) Memory management unit Instruction decoder
Logic	Combinational Sequential Glue Logic gate Quantum Array
Registers	Processor register Status register Stack register Register file Memory buffer Memory address register Program counter
Control unit	Hardwired control unit Instruction unit Data buffer Write buffer Microcode ROM Counter
Datapath	Multiplexer Demultiplexer Adder Multiplier CPU Binary decoder Address decoder Sum-addressed decoder Barrel shifter
Circuitry	Integrated circuit 3D Mixed-signal Power management Boolean Digital Analog Quantum Switch

Power
management

Related

Retrieved from "https://en.wikipedia.org/w/index.php?title=Out-of-order_execution&oldid=1314416439"

Instruction processing

Hidden categories:

[8]ページ先頭

©2009-2026 Movatter.jp