Movatterモバイル変換

Simultaneous multithreading

From Wikipedia, the free encyclopedia

Efficiency improving technique for superscalar CPUs

Simultaneous multithreading (SMT) is a technique for improving the overall efficiency ofsuperscalar CPUs withhardware multithreading. SMT permits multiple independentthreads of execution to better use the resources provided by modernprocessor architectures.

Details

[edit]

The termmultithreading is ambiguous, because not only can multiple threads be executed simultaneously on one CPU core, but also multiple tasks (with differentpage tables, differenttask state segments, differentprotection rings, differentI/O permissions, etc.). Although running on the same core, they are completely separated from each other.Multithreading is similar in concept topreemptive multitasking but is implemented at the thread level of execution in modern superscalar processors.

Simultaneous multithreading (SMT) is one of the two main implementations of multithreading, the other form beingtemporal multithreading (also known as super-threading). In temporal multithreading, only one thread of instructions can execute in any given pipeline stage at a time. In simultaneous multithreading, instructions from more than one thread can be executed in any given pipeline stage at a time. This is done without great changes to the basic processor architecture: the main additions needed are the ability to fetch instructions from multiple threads in a cycle, and a larger register file to hold data from multiple threads. The number of concurrent threads is decided by the chip designers. Two concurrent threads per CPU core are common, but some processors support many more.^[1]

Because it inevitably increases conflict on shared resources, measuring or agreeing on its effectiveness can be difficult. However, measuredenergy efficiency of SMT with parallel native and managed workloads on historical 130 nm to 32 nm Intel SMT (hyper-threading) implementations found that in 45 nm and 32 nm implementations, SMT is extremely energy efficient, even with in-order Atom processors.^[2] In modern systems, SMT effectively exploits concurrency with very little additional dynamic power. That is, even when performance gains are minimal the power consumption savings can be considerable.^[2]

Some researchers^[who?] have even shown that the extra threads can be used proactively to seed ashared resource like a cache, to improve the performance of another single thread, and claim this shows that SMT does not only increase efficiency. Others^[who?] use SMT to provide redundant computation, for some level of error detection and recovery.^{[citation needed]}

Nevertheless, in most current cases, SMT is about hidingstalls during high-latency activities such as memory access, therefore increasing efficiency as well as throughput of computations per amount of hardware used by making more use of existing resources.

Taxonomy

[edit]

In processor design, there are two ways to increase on-chip parallelism with fewer resource requirements: one is superscalar technique which tries to exploitinstruction-level parallelism (ILP); the other is multithreading approach exploitingthread-level parallelism (TLP).

Superscalar means executing multiple instructions at the same time while thread-level parallelism (TLP) executes instructions from multiple threads within one processor chip at the same time. There are many ways to support more than one thread within a chip, namely:

Interleaved multithreading: Interleaved issue of multiple instructions from different threads, also referred to astemporal multithreading. It can be further divided into fine-grained multithreading or coarse-grained multithreading depending on the frequency of interleaved issues.Fine-grained multithreading—such as in abarrel processor—issues instructions for different threads after every cycle, whilecoarse-grained multithreading only switches to issue instructions from another thread when the current executing thread causes some long latency events (like page fault etc.). Coarse-grain multithreading is more common for less context switch between threads. For example, Intel'sMontecito processor uses coarse-grained multithreading, while Sun'sUltraSPARC T1 uses fine-grained multithreading. For those processors that have only one pipeline per core, interleaved multithreading is the only possible way, because it can issue at most one instruction per cycle.
Simultaneous multithreading (SMT): Issue multiple instructions from multiple threads in one cycle. The processor must be superscalar to do so.
Chip-level multiprocessing (CMP ormulticore): integrates two or more processors into one chip, each executing threads independently.
Any combination of multithreaded/SMT/CMP.

The key factor to distinguish them is to look at how many instructions the processor can issue in one cycle and how many threads from which the instructions come. For example, Sun Microsystems' UltraSPARC T1 is a multicore processor combined with fine-grain multithreading technique instead of simultaneous multithreading because each core can only issue one instruction at a time.

Historical implementations

[edit]

While multithreading CPUs have been around since the 1950s, simultaneous multithreading was first researched by IBM in 1968 as part of theACS-360 project.^[3] The first major commercial microprocessor developed with SMT was theAlpha 21464 (EV8). This microprocessor was developed byDEC in coordination with Dean Tullsen of the University of California, San Diego, and Susan Eggers and Henry Levy of the University of Washington. The microprocessor was never released, since the Alpha line of microprocessors was discontinued shortly beforeHP acquiredCompaq which had in turn acquiredDEC. Dean Tullsen's work was also used to develop thehyper-threaded versions of the Intel Pentium 4 microprocessors, such as the "Northwood" and "Prescott".

Modern commercial implementations

[edit]

x86/x86-64

[edit]

TheIntel Pentium 4 was the first modern desktop processor to implement simultaneous multithreading, starting from the 3.06 GHz model released in 2002, and since introduced into a number of their processors. Intel calls the functionalityHyper-Threading Technology, and provides a basic two-thread SMT engine. Intel claims up to a 30% speed improvement^[4] compared against an otherwise identical, non-SMT Pentium 4. The performance improvement seen is very application-dependent; however, when running two programs that require full attention of the processor it can actually seem like one or both of the programs slows down slightly when Hyper-threading is turned on.^[5] This is due to thereplay system of the Pentium 4 tying up valuable execution resources, increasing contention for resources such as bandwidth, caches,TLBs,re-order buffer entries, and equalizing the processor resources between the two programs which adds a varying amount of execution time. The Pentium 4 Prescott core gained a replay queue, which reduces execution time needed for the replay system. This was enough to completely overcome that performance hit.^[6]

TheIntel Atom, first released in 2008, is the first Intel product to feature 2-way SMT (marketed as Hyper-Threading) without supporting instruction reordering, speculative execution, or register renaming. Intel reintroduced Hyper-Threading with theNehalem microarchitecture, after its absence on theCore microarchitecture.

Intel Xeon Phi (2010–2020) has 4-way SMT (with time-multiplexed multithreading) with hardware-based threads which cannot be disabled, unlike regular Hyper-Threading.^[7]

AMDBulldozer microarchitecture (2011) use two-thread "modules". In each module there are two separate integer cores but FlexFPU and L2 cache are shared, so it is only a partial SMT implementation.^[8]^[9]

AMD'sZen family of microarchitectures has 2-way SMT. Most resources in aZen 5 core is competitively shared in SMT, allowing the active thread to take all resources (or "most" in the case of watermarked resources). The statically-partitioned exceptions are the micro-op queue, the retirement queue, and the FPU non-scheduling queue.^[10]

MIPS

[edit]

The latest^[when?]Imagination Technologies MIPS architecture designs include an SMT system known as "MIPS MT".^[11] MIPS MT provides for both heavyweight virtual processing elements and lighter-weight hardware microthreads.RMI, a Cupertino-based startup, is the first MIPS vendor to provide a processorSOC based on eight cores, each of which runs four threads. The threads can be run in fine-grain mode where a different thread can be executed each cycle. The threads can also be assigned priorities.Imagination Technologies MIPS CPUs have two SMT threads per core.

POWER/PowerPC/Power ISA

[edit]

IBM'sBlue Gene/Q has 4-way SMT.

The IBMPOWER5, announced in May 2004, comes as either a dual core dual-chip module (DCM), or quad-core or oct-core multi-chip module (MCM), with each core including a two-thread SMT engine.IBM's implementation is more sophisticated than the previous ones, because it can assign a different priority to the various threads, is more fine-grained, and the SMT engine can be turned on and off dynamically, to better execute those workloads where an SMT processor would not increase performance. This is IBM's second implementation of generally available hardware multithreading. In 2010, IBM released systems based on the POWER7 processor with eight cores with each having four Simultaneous Intelligent Threads. This switches the threading mode between one thread, two threads or four threads depending on the number of process threads being scheduled at the time. This optimizes the use of the core for minimum response time or maximum throughput. IBMPOWER8 has 8 intelligent simultaneous threads per core (SMT8).

IBM Z

[edit]

IBM Z starting with thez13 processor in 2013 has two threads per core (SMT-2).

SPARC

[edit]

Although many people reported thatSun Microsystems' UltraSPARC T1 (known as "Niagara" until its 14 November 2005 release) and the now defunct processorcodenamed "Rock" (originally announced in 2005, but after many delays cancelled in 2010) are implementations ofSPARC focused almost entirely on exploiting SMT and CMP techniques, Niagara is not actually using SMT. Sun refers to these combined approaches as "CMT", and the overall concept as "Throughput Computing". The Niagara has eight cores, but each core has only one pipeline, so actually it uses fine-grained multithreading. Unlike SMT, where instructions from multiple threads share the issue window each cycle, the processor uses a round robin policy to issue instructions from the next active thread each cycle. This makes it more similar to abarrel processor. Sun Microsystems' Rock processor is different: it has more complex cores that have more than one pipeline.

TheOracle Corporation SPARC T3 has eight fine-grained threads per core; SPARC T4, SPARC T5, SPARC M5, M6 and M7 have eight fine-grained threads per core of which two can be executed simultaneously.

Fujitsu SPARC64 VI has coarse-grained Vertical Multithreading (VMT) SPARC VII and newer have 2-way SMT.

Other instruction set architectures

[edit]

IntelItanium (IA-64, 2001–2020) Montecito uses coarse-grained multithreading and Tukwila and newer ones use 2-way SMT (with dual-domain multithreading).

VISC architecture (2016)^[12]^[13]^[14]^[15] uses theVirtual Software Layer (translation layer) to dispatch a single thread of instructions to theGlobal Front End which splits instructions intovirtual hardware threadlets which are then dispatched to separate virtual cores. These virtual cores can then send them to the available resources on any of the physical cores. Multiple virtual cores can push threadlets into the reorder buffer of a single physical core, which can split partial instructions and data from multiple threadlets through the execution ports at the same time. Each virtual core keeps track of the position of the relative output. This form of multithreading can increase single threaded performance by allowing a single thread to use all resources of the CPU. The allocation of resources is dynamic on a near-single cycle latency level (1–4 cycles depending on the change in allocation depending on individual application needs. Therefore, if two virtual cores are competing for resources, there are appropriate algorithms in place to determine what resources are to be allocated where.

Disadvantages

[edit]

SMT introduces sharing of resources. Some sharing schemes (e.g. static division) do not allow one thread to take more of one resource even when another thread is not using it, creating a potential bottleneck for performance compared to the non-SMT case.^[16] There is also a potential fairness issue. Newer SMT implementations try to minimize the occurrence of these problems with a combination of static partitioning, competitive sharing, and competitive sharing with watermarking, the latter being a combination of both.^[17]

Critics argue that it is a considerable burden to put on software developers that they have to test whether simultaneous multithreading is good or bad for their application in various situations and insert extra logic to turn it off if it decreases performance. 2009 operating systems lack convenientAPI calls for this purpose and for preventing processes with different priority from taking resources from each other.^[18] A cross-platformhwloc library is available to detect the presence of SMT as well asNUMA setups, both of which often require consideration from the programmer.Prime95 is one program that uses hwloc: by default it uses the additional SMT threads when doing integer trial-division, but only uses one thread per core for its usual floating-point-heavy operations.^[19]

Security concern

[edit]

There is also a security concern with certain simultaneous multithreading implementations from bugs and side-channel information leaks. Intel's hyperthreading inNetBurst-based processors has a vulnerability through which it is possible for one application to steal acryptographic key from another application running in the same processor by monitoring its cache use.^[20] There are also sophisticated machine learning exploits to HT implementation that were explained atBlack Hat 2018.^[21]

References

[edit]

^"The First Direct Mesh-to-Mesh Photonic Fabric"(PDF). Archived fromthe original(PDF) on 2024-02-08. Retrieved2024-02-08.
^^a ^bASPLOS'11
^Smotherman, Mark (25 May 2011)."End of IBM ACS Project". School of Computing, Clemson University. RetrievedJanuary 19, 2013.
^Marr, Deborah (February 14, 2002)."Hyper-Threading Technology Architecture and Microarchitecture"(PDF).Intel Technology Journal.6 (1): 4.doi:10.1535/itj. Archived fromthe original(PDF) on 24 October 2016. Retrieved25 September 2015.
^"CPU performance evaluation Pentium 4 2.8 and 3.0". Archived fromthe original on 2021-02-24. Retrieved2011-04-22.
^"Replay: Unknown Features of the NetBurst Core. Page 15".Replay: Unknown Features of the NetBurst Core. xbitlabs.com. Archived fromthe original on 14 May 2011. Retrieved24 April 2011.
^Barth, Michaela; Byckling, Mikko; Ilieva, Nevena; Saarinen, Sami; Schliephake, Michael (18 February 2014). Weinberg, Volker (ed.)."Best Practice Guide Intel Xeon Phi v1.1". Partnership for Advanced Computing in Europe. Archived fromthe original on 3 May 2017. Retrieved22 November 2016.
^"AMD Bulldozer Family Module Multithreading". wccftech. July 2013. Archived fromthe original on 2013-10-17. Retrieved2013-07-22.
^Halfacree, Gareth (28 October 2010)."AMD unveils Flex FP". bit-tech.
^"NO COMPROMISE: DRIVING SERVER PERFORMANCE AND EFFICIENCY WITH AMD EPYC™ AND SMT"(PDF). April 2025.
^"MIPS MT ASE description".Imagination Technologies.
^"Soft Machines unveils VISC virtual chip architecture | bit-tech.net".
^Cutress, Ian (12 February 2016)."Examining Soft Machines' Architecture: An Element of VISC to Improving IPC". AnandTech. Archived fromthe original on February 13, 2016.
^"Next Gen Processor Performance Revealed".VR World. February 4, 2016. Archived fromthe original on 2017-01-13.
^"Architectural Waves". Soft Machines. 2017. Archived fromthe original on 2017-03-29.
^"Replay: Unknown Features of the NetBurst Core. Page 15".Replay: Unknown Features of the NetBurst Core. xbitlabs.com. Archived fromthe original on 14 May 2011. Retrieved24 April 2011.
^"Simultaneous Multithreading: Driving Performance and Efficiency on AMD EPYC CPUs". 2025.
^"How good is hyperthreading?". 2009. Archived fromthe original on 2025-06-17.
^Prime95 version 30.19, program dialog "Option/Resource Limits/Advanced"
^Hyper-Threading Considered Harmful
^TLBleed: When Protecting Your CPU Caches is Not Enough

General

Shar, Leonard E.; Davidson, Edward S. (February 1974). "A multiminiprocessor system implemented through pipelining".Computer.7 (2):42–51.Bibcode:1974Compr...7b..42S.doi:10.1109/MC.1974.6323457.S2CID 27957358.
Tullsen, D.M.; Eggers, S.J.; Levy, H.M. (1995)."Simultaneous multithreading: Maximizing on-chip parallelism".22nd Annual International Symposium on Computer Architecture. IEEE. pp. 392–403.ISBN 978-0-89791-698-1.
Tullsen, D.M.; Eggers, S.J.; Emer, J.S.; Levy, H.M.; Lo, J.L.; Stamm, R.L. (1996)."Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor".23rd Annual International Symposium on Computer Architecture. IEEE. p. 191.doi:10.1145/232973.232993.ISBN 978-0-89791-786-5.S2CID 1402376.
Esmaeilzadeh, H.; Cao, T.; Yang, X.; Blackburn, S.M.; McKinley, K.S. (2011)."Looking back on the language and hardware revolutions: measured power, performance, and scaling"(PDF).ASPLOS XVI Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems. ACM. pp. 319–332.doi:10.1145/1950365.1950402.ISBN 978-1-4503-0266-1.S2CID 6845129.

External links

[edit]

SMT news articles and academic papers
SMT research at the University of Washington
Smotherman, Mark (November 2007)."Timeline of multithreading technologies". School of Computing, Clemson University.

Processor technologies

Models

Architecture

Instruction set
architectures

Types	Orthogonal instruction set CISC RISC Application-specific EDGE TRIPS VLIW EPIC MISC OISC NISC ZISC VISC architecture Quantum computing Comparison Addressing modes
Instruction sets	Motorola 68000 series VAX PDP-11 x86 ARM Stanford MIPS MIPS MIPS-X Power POWER PowerPC Power ISA Clipper architecture SPARC SuperH DEC Alpha ETRAX CRIS M32R Unicore Itanium OpenRISC RISC-V MicroBlaze LMC System/3x0 S/360 S/370 S/390 z/Architecture Tilera ISA VISC architecture Epiphany architecture Others

Execution

Instruction pipelining	Pipeline stall Operand forwarding Classic RISC pipeline
Hazards	Data dependency Structural Control False sharing
Out-of-order	Scoreboarding Tomasulo's algorithm Reservation station Re-order buffer Register renaming Wide-issue
Speculative	Branch prediction Memory dependence prediction

Parallelism

Level	Bit Bit-serial Word Instruction Pipelining Scalar Superscalar Task Thread Process Data Vector Memory Distributed
Multithreading	Temporal Simultaneous Hyperthreading Simultaneous and heterogenous Speculative Preemptive Cooperative
Flynn's taxonomy	SISD SIMD Array processing (SIMT) Pipelined processing Associative processing SWAR MISD MIMD SPMD

Processor
performance

Transistor count
Instructions per cycle (IPC)
- Cycles per instruction (CPI)
Instructions per second (IPS)
Floating-point operations per second (FLOPS)
Transactions per second (TPS)
Synaptic updates per second (SUPS)
Performance per watt (PPW)
Cache performance metrics
Computer performance by orders of magnitude

Types

By application	Embedded system Microprocessor Microcontroller Mobile Ultra-low-voltage ASIP Soft microprocessor
Systems on chip	System on a chip (SoC) Multiprocessor (MPSoC) Cypress PSoC Network on a chip (NoC)
Hardware accelerators	Coprocessor AI accelerator Graphics processing unit (GPU) Image processor Vision processing unit (VPU) Physics processing unit (PPU) Digital signal processor (DSP) Tensor Processing Unit (TPU) Secure cryptoprocessor Network processor Baseband processor

Word size

Core count

Components

Functional units	Arithmetic logic unit (ALU) Address generation unit (AGU) Floating-point unit (FPU) Memory management unit (MMU) Load–store unit Translation lookaside buffer (TLB) Branch predictor Branch target predictor Integrated memory controller (IMC) Memory management unit Instruction decoder
Logic	Combinational Sequential Glue Logic gate Quantum Array
Registers	Processor register Status register Stack register Register file Memory buffer Memory address register Program counter
Control unit	Hardwired control unit Instruction unit Data buffer Write buffer Microcode ROM Counter
Datapath	Multiplexer Demultiplexer Adder Multiplier CPU Binary decoder Address decoder Sum-addressed decoder Barrel shifter
Circuitry	Integrated circuit 3D Mixed-signal Power management Boolean Digital Analog Quantum Switch

Power
management

v t e Parallel computing
General	Distributed computing Parallel computing Parallel algorithm Massively parallel Cloud computing High-performance computing Multiprocessing Manycore processor GPGPU Computer network Systolic array
Levels	Bit Instruction Thread Task Data Memory Loop Pipeline
Multithreading	Temporal Simultaneous (SMT) Simultaneous and heterogenous Speculative (SpMT) Preemptive Cooperative Clustered multi-thread (CMT) Hardware scout
Theory	PRAM model PEM model Analysis of parallel algorithms Amdahl's law Gustafson's law Cost efficiency Karp–Flatt metric Slowdown Speedup
Elements	Process Thread Fiber Instruction window Array
Coordination	Multiprocessing Memory coherence Cache coherence Cache invalidation Barrier Synchronization Application checkpointing
Programming	Stream processing Dataflow programming Models Implicit parallelism Explicit parallelism Concurrency Non-blocking algorithm
Hardware	Flynn's taxonomy SISD SIMD Array processing (SIMT) Pipelined processing Associative processing MISD MIMD Dataflow architecture Pipelined processor Superscalar processor Vector processor Multiprocessor symmetric asymmetric Memory shared distributed distributed shared UMA NUMA COMA Massively parallel computer Computer cluster Beowulf cluster Grid computer Hardware acceleration
APIs	Ateji PX Boost Chapel HPX Charm++ Cilk Coarray Fortran CUDA Dryad C++ AMP Global Arrays GPUOpen MPI OpenMP OpenCL OpenHMPP OpenACC Parallel Extensions PVM pthreads RaftLib ROCm UPC TBB ZPL
Problems	Automatic parallelization Deadlock Deterministic algorithm Embarrassingly parallel Parallel slowdown Race condition Software lockout Scalability Starvation
Category: Parallel computing