Movatterモバイル変換

[0]ホーム

Jump to content

SWAR

Edit links

From Wikipedia, the free encyclopedia

Parallel processing technique

For musical notes, seeswara.

SIMD within a register (SWAR), also known by the name "packed SIMD"^[1] is a technique for performing parallel operations on data contained in aprocessor register.SIMD stands forsingle instruction, multiple data.

Flynn's taxonomy
Single data stream
SISD MISD
Multiple data streams
SIMD MIMD
SIMD subcategories^[2]
Array processing (SIMT) Pipelined processing (packed SIMD) Associative processing (predicated/masked SIMD)
See also
SPMD MPMD

Many modern general-purpose computer processors have some provisions forSIMD, in the form of a group of registers and instructions to make use of them. SWAR refers to the use of those registers and instructions, as opposed to using specialized processing engines designed to be better at SIMD operations. It also refers to the use of SIMD with general-purpose registers and instructions that were not meant to do it at the time, by way of various novel software tricks.^[3]

SWAR architectures

[edit]

A SWAR architecture is one that includes instructions explicitly intended to perform parallel operations across data that is stored in the independent subwords or fields of a register. A SWAR-capable architecture is one that includes a set of instructions that is sufficient to allow data stored in these fields to be treated independently even though the architecture does not include instructions that are explicitly intended for that purpose.

One of the first historic examples, operational in 1958, was theLincoln Laboratory TX-2 which had 36-bit wide memory, and instructions that could operate in the ALU on one 36-bit, or on two 18-bit or four 9-bit sub-words. TheTX-2 pre-dates the invention of the term SIMD.^[4]

An early well-known example of a SWAR architecture was theIntel Pentium with MMX, which implemented theMMX extension set. TheIntel Pentium, by contrast, did not include such instructions, but could still act as a SWAR architecture through careful hand-coding or compiler techniques.

Early SWAR architectures includeDEC Alpha MVI, Hewlett-Packard's PA-RISC MAX, Silicon Graphics Incorporated'sMIPS MDMX, and Sun'sSPARC V9VIS. Like MMX, many of the SWAR instruction sets are intended for faster video coding.^[5]

History of the SWAR programming model

[edit]

Wesley A. Clark introduced partitioned subword data operations in the 1950s^{[citation needed]}. This can be seen as a very early predecessor to SWAR.Leslie Lamport presented SWAR techniques in his paper titled "Multiple byte processing with full-word instructions"^[6] in 1975.

With the introduction of Intel's MMX multimedia instruction set extensions in 1996, desktop processors with SIMD parallel processing capabilities became common. Early on, these instructions could only be used via hand-written assembly code.

In the fall of 1996, Professor Hank Dietz was the instructor for the undergraduate Compiler Construction course at Purdue University's School of Electrical and Computer Engineering. For this course, he assigned a series of projects in which the students would build a simple compiler targeting MMX. The input language was a subset dialect ofMasPar's MPL called NEMPL (Not Exactly MPL).

During the course of the semester, it became clear to the course teaching assistant, Randall (Randy) Fisher, that there were a number of issues with MMX that would make it difficult to build the back-end of the NEMPL compiler. For example, MMX has an instruction for multiplying 16-bit data but not multiplying 8-bit data. The NEMPL language did not account for this problem, allowing the programmer to write programs that required 8-bit multiplies.

Intel's x86 architecture was not the only architecture to include SIMD-like parallel instructions. Sun'sVIS, SGI'sMDMX, and other multimedia instruction sets had been added to other manufacturers' existing instruction set architectures to support so-callednew media applications. These extensions had significant differences in the precision of data and types of instructions supported.

Dietz and Fisher began developing the idea of a well-defined parallel programming model that would allow the programming to target the model without knowing the specifics of the target architecture. This model would become the basis of Fisher's dissertation. The acronym "SWAR" was coined by Dietz and Fisher one day in Hank's office in the MSEE building at Purdue University.^[7]It refers to this form of parallel processing, architectures that are designed to natively perform this type of processing, and the general-purpose programming model that is Fisher's dissertation.

The problem of compiling for these widely varying architectures was discussed in a paper presented at LCPC98.^[5]

Some applications of SWAR

[edit]

SWAR processing has been used inimage processing,^[8]cryptographic pairings,^[9]raster processing,^[10]computational fluid dynamics,^[11]and communications.^[12]

Examples

[edit]

SWAR techniques can be used even on systems without special hardware support.Logical operations act bitwise, so act on each bit of a register independently. Using addition and subtraction is more difficult, but can be useful if care is taken to avoid unwantedcarry propagation between lanes. Except for this carry propagation, one 64-bit addition or subtraction is the same as performing eight 8-bit additions or subtractions.

Counting bits set

[edit]

Probably the archetypical example of SWAR techniques is finding thepopulation count of (number of bits set in) a register. The register is treated successively as a series of 1-bit, 2-bit, 4-bit, etc. fields.

To begin with, note that the population count of a 1-bit field is simply the field itself. To find the population count of a 2-bit field, sum the population counts of its two constituent 1-bit fields. This can be done in parallel for 32 2-bit fields in a 64-bit valuex:

x2 := (x & 0x5555555555555555) + ((x >> 1) & 0x5555555555555555);

Thehexadecimal constant0x5 is binary 0101₂, which isolates even-numbered bits. The addition cannot overflow each 2-bit field, as the maximum possible sum is 2.

This can be repeated to combine 2-bit fields into 4-bit fields. Here, we use a mask of binary 0011₂, or hexadecimal0x3, to isolate pairs of bits:

x4 := (x2 & 0x3333333333333333) + ((x2 >> 2) & 0x3333333333333333);

Now each 4-bit field contains a count from 0 to 4. Because a 4-bit field can contain a value up to 15, no overflow is possible when adding two 4-bit population counts, allowing the masking to be doneafter the addition, rather than once per addend:

x8 := (x4 + (x4 >> 4)) & 0x0f0f0f0f0f0f0f0f;

At this point, the 8-bit fields can hold values up to 255, so there is no need for further masking until the very end:

x16 = x8 + (x8 >> 8);x32 = x16 + (x16 >> 16);x64 = x32 + (x32 >> 32);population_count = x64 & 0xff;

Further refinements

[edit]

There are several well-known variants on this. In particular, the last three shift-and-add steps can be combined into

 population_count = (x8 * 0x0101010101010101) >> 56;

Three stages of shifting and adding require 6 instructions, each with adata dependency on the previous one, so take at least 6 clock cycles. A multiplication can usually be performed faster. When operating on 32-bit words, it's less clear, as a 3-cycle multiply is common.

A second variant is a change to the first step. Rather than combining the two bitsb₁ andb₀ in each 2-bit field by adding them, consider the initial value of the 2-bit field as 2b₁ + b₀. Subtractingb₁ from this will produce the desired sum, with only one masking operation:

x2 := x − ((x >> 1) & 0x5555555555555555);

Finding zero bytes

[edit]

It is common to search acharacter string for anull terminator. Doing this one byte at a time is inefficient when a 64-bit processor can operate on 8 bytes at a time.

The same technique can be used to search forpathname separators or other delimiters, byexclusive-oring with the target byte value first.

Some architectures include special instructions for performing 8 byte comparisons at once. For example, theDEC Alpha included aCMPBGE instruction for performing 8 byte compares at once. However, searching for a zero byte can be done without any special support.

One way would be toOR together 8 bits in a manner much like the bit-counting example above:

x2 = x  | x<<1;x4 = x2 | x2<<2;x8 = x4 | x4<<4;byte_map = ~x8 & 0x8080808080808080;

This results in abyte_map with a 1 bit in the most significant bit of any byte that was originally zero.

However, this can be done more quickly by taking advantage of carry propagation using arithmetic operations. Adding0x7f (binary 01111111₂) to each byte causes a carry into bit 7 if the low 7 bits are non-zero. The challenge is ensure the carry propagationstops at bit 7 and does not affect other bytes. This can be done by working on the low 7 bits and high bit of each byte separately. First, extract the low 7 bits of each byte byANDing with0x7f before adding0x7f:

x7 = (x & 0x7f7f7f7f7f7f7f7f) + 0x7f7f7f7f7f7f7f7f;

Then combine with the most-significant bits:

x8 = x7 | x;

This value will have the msbit of each 8-bit field set to 1 if that byte is non-zero. Finally:

byte_map = ~(x8 | 0x7f7f7f7f7f7f7f7f);

will set all the unwanted low bits in each byte, then complement everything, leaving only 1 bits anywhere the corresponding input byteis zero. (This is equivalent to~x8 & 0x80...80, but uses the same constant value.) If there are no 1 bits, the search can continue with the following word. If there are any 1 bits, the length of the string can be computed from their positions.

Further refinements

[edit]

If the goal is limited to finding thefirst zero byte on alittle-endian processor, it is possible to find theleast significant zero byte in fewer operations, using two different constants:^[13]

x7 = x − 0x0101010101010101;byte_map = x7 & ~x & 0x8080808080808080;

For each byteb, this sets its msbit inbyte_map if the msbit ofb − 1 is set and the msbit ofb is clear, something that only happens ifb = 0.

The preceding statement is only trueif there is no borrow in; if thereis a borrow, the condition will also be true ifb = 1. However, such a borrow can only be generated by a less significant zero byte, so theleast significant zero byte will be correctly identified, as desired.

Not only does this save one binary operation, but they are not all sequentially dependent, so it can be performed in two cycles assuming the existence of an "and not" (bit clear) instruction

Small table lookups

[edit]

As a generalization of abitmap, it is possible to store very smalllookup tables in a single register. For example, the number of days in a month varies from 28 to 31, a range of 4 values. This can be stored in 12×2 = 24 bits:

days_table = 0xeefbb3 + (is_leap_year << 2);days_in_month = 28 + (days_table >> 2*month & 3);

(This is assuming a0-based month number. A 1-based month number can be accommodated by shifting thedays_table.)

The fact that the table fits neatly into one register makes it easy to modify forleap years.

References

[edit]

^Miyaoka, Y.; Choi, J.; Togawa, N.; Yanagisawa, M.; Ohtsuki, T. (2002).An algorithm of hardware unit generation for processor core synthesis with packed SIMD type instructions. Asia-Pacific Conference on Circuits and Systems. Vol. 1. pp. 171–176.doi:10.1109/APCCAS.2002.1114930.hdl:2065/10689.
^Flynn, Michael J. (September 1972)."Some Computer Organizations and Their Effectiveness"(PDF).IEEE Transactions on Computers.C-21 (9):948–960.doi:10.1109/TC.1972.5009071.
^Fisher, Randall J (2003).General-Purpose SIMD Within A Register: Parallel Processing on Consumer Microprocessors(PDF) (Ph.D.). Purdue University.
^"Archived copy"(PDF). Archived fromthe original(PDF) on 2021-04-22.{{cite web}}: CS1 maint: archived copy as title (link)
^^a ^bFisher, Randall J.; Henry G. Dietz (August 1998). S. Chatterjee; J. F. Prins; L. Carter; J. Ferrante; Z. Li; D. Sehr; P.-C.Yew (eds.). "Compiling for SIMD Within A Register".Proceedings of the 11th International Workshop on Languages and Compilers for Parallel Computing.
^Lamport, Leslie (August 1975)."Multiple byte processing with full-word instructions".Communications of the ACM.18 (8):471–475.doi:10.1145/360933.360994.S2CID 1593593.
^Dietz, Hank."The Aggregate Magic Algorithms".
^Padua, Flavio L. C.; Pereira, Guilherme A. S.; Neto, Jose P. de Queiroz; Campos, Mario F. M.; Fernandes, Antonio O. (January 2001).Improving processing time of large images by instruction level parallelism(PDF). Chilean Computing Week, V Workshop on Parallel and Distributed Systems. Punta Arenas. Archived fromthe original(PDF) on 2007-02-25. Retrieved2012-12-05.
^Grabher, Philipp; Johann Großschädl; Dan Page (2009). "On Software Parallel Implementation of Cryptographic Pairings".Selected Areas in Cryptography. Lecture Notes in Computer Science. Vol. 5381. pp. 35–50.doi:10.1007/978-3-642-04159-4_3.ISBN 978-3-642-04158-7.
^Persada, Onil Nazra; Thierry Goubier (12–14 September 2004). "Accelerating Raster Processing with Fine and Coarse Grain Parallelism in GRASS".Proceedings of the FOSS/GRASS Users Conference 2004.
^Hauser, Thomas; T. I. Mattox; R. P. LeBeau; H. G. Dietz; P. G. Huang (April 2003). "Code Optimizations for Complex Microprocessors Applied to CFD Software".SIAM Journal on Scientific Computing.25 (4):1461–1477.doi:10.1137/S1064827502410530.ISSN 1064-8275.
^Spracklen, Lawrence A. (2001).SWAR Systems and Communications Applications(PDF) (Ph.D.). University of Aberdeen.
^Fisher, James (2017-01-24)."Quickly checking for a zero byte in C using bitwise operations". Retrieved2024-12-21.