This articleneeds additional citations forverification. Please helpimprove this article byadding citations to reliable sources. Unsourced material may be challenged and removed. Find sources: "Streaming SIMD Extensions" – news ·newspapers ·books ·scholar ·JSTOR(June 2014) (Learn how and when to remove this message) |
Incomputing,Streaming SIMD Extensions (SSE) is a single instruction, multiple data (SIMD)instruction set extension to thex86 architecture, designed byIntel and introduced in 1999 in itsPentium III series ofcentral processing units (CPUs) shortly after the appearance ofAdvanced Micro Devices (AMD's)3DNow!. SSE contains 70 new instructions (65 unique mnemonics[1] using 70 encodings), most of which work onsingle precisionfloating-point data. SIMD instructions can greatly increase performance when exactly the same operations are to be performed on multiple data objects. Typical applications aredigital signal processing andgraphics processing.
Intel's firstIA-32 SIMD effort was theMMX instruction set. MMX had two main problems: it re-used existingx87 floating-point registers making the CPUs unable to work on both floating-point and SIMD data at the same time, and it only worked onintegers. SSE floating-point instructions operate on a new independent register set, the XMM registers, and adds a few integer instructions that work on MMX registers.
SSE was subsequently expanded by Intel toSSE2,SSE3,SSSE3 andSSE4. Because it supports floating-point math, it had wider applications than MMX and became more popular. The addition of integer support in SSE2 made MMX largely redundant, though further performance increases can be attained in some situations[when?] by using MMX in parallel with SSE operations.
SSE was originally calledKatmai New Instructions (KNI),Katmai being the code name for the first Pentium III core revision. During the Katmai project Intel sought to distinguish it from its earlier product line, particularly its flagshipPentium II. It was later renamedInternet Streaming SIMD Extensions (ISSE[2]), then SSE.
AMD added a subset of SSE, 19 of them, callednew MMX instructions,[3] and known as several variants and combinations of SSE and MMX, or otherwise asInteger SSE (ISSE, not to be confused withInternet Streaming SIMD Extensions, an early name for SSE) shortly after with the release of the originalAthlon in August 1999 (see3DNow! extensions). AMD eventually added full support for SSE instructions (sometimes referred to as3DNow! Professional) starting with itsAthlon XP (Corvette andPalomino cores) andDuron (Morgan core) processors.
SSE originally added eight new 128-bit registers known asXMM0 throughXMM7. TheAMD64 extensions from AMD added a further eight registersXMM8 throughXMM15, and this extension is duplicated in theIntel 64 architecture. There is also a new 32-bit control/status register,MXCSR. The registersXMM8 throughXMM15 are accessible only in 64-bit operating mode.

SSE used only a single data type for XMM registers:
SSE2 would later expand the usage of the XMM registers to include:
Because these 128-bit registers are additional machine states that theoperating system must preserve acrosstask switches, they are disabled by default until the operating system explicitly enables them. This means that the OS must know how to use theFXSAVE andFXRSTOR instructions, which is the extended pair of instructions that can save allx86 and SSE register states at once. This support was quickly added to all major IA-32 operating systems.
The first CPU to support SSE, thePentium III, shared execution resources between SSE and thefloating-point unit (FPU).[2] While acompiled application can interleave FPU and SSE instructions side-by-side, the Pentium III will not issue an FPU and an SSE instruction in the sameclock cycle. This limitation reduces the effectiveness ofpipelining, but the separate XMM registers do allow SIMD and scalar floating-point operations to be mixed without the performance hit from explicit MMX/floating-point mode switching.
SSE introduced bothscalar andpacked floating-point instructions.
Floating operations are IEEE 754-1985 compliant, with the exception ofRSQRTSS, which is not specified in the standard.
MOVSSMOVAPS, MOVUPS, MOVLPS, MOVHPS, MOVLHPS, MOVHLPS, MOVMSKPSADDSS, SUBSS, MULSS, DIVSS, RCPSS, SQRTSS, MAXSS, MINSS, RSQRTSSADDPS, SUBPS, MULPS, DIVPS, RCPPS, SQRTPS, MAXPS, MINPS, RSQRTPSCMPSS, COMISS, UCOMISSCMPPSSHUFPS, UNPCKHPS, UNPCKLPSCVTSI2SS, CVTSS2SI, CVTTSS2SICVTPI2PS, CVTPS2PI, CVTTPS2PIANDPS, ORPS, XORPS, ANDNPSPMULHUW, PSADBW, PAVGB, PAVGW, PMAXUB, PMINUB, PMAXSW, PMINSWPEXTRW, PINSRWPMOVMSKB, PSHUFWMXCSR managementLDMXCSR, STMXCSRMOVNTQ, MOVNTPS, MASKMOVQ, PREFETCH0, PREFETCH1, PREFETCH2, PREFETCHNTA, SFENCEThe following simple example demonstrates the advantage of using SSE. Consider an operation like vector addition, which is used very often in computer graphics applications. To add two single precision, four-component vectors together using x86 requires four floating-point addition instructions.
vec_res.x=v1.x+v2.x;vec_res.y=v1.y+v2.y;vec_res.z=v1.z+v2.z;vec_res.w=v1.w+v2.w;
This corresponds to four x86 FADD instructions in the object code. On the other hand, as the following pseudo-code shows, a single 128-bit 'packed-add' instruction can replace the four scalar addition instructions.
movapsxmm0,[v1];xmm0 = v1.w | v1.z | v1.y | v1.xaddpsxmm0,[v2];xmm0 = v1.w+v2.w | v1.z+v2.z | v1.y+v2.y | v1.x+v2.xmovaps[vec_res],xmm0;xmm0
popcnt instruction (Population count: count number of bits set to 1, used extensively e.g. incryptography), and more.The following programs can be used to determine which, if any, versions of SSE are supported on a system