Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Minifloat

From Wikipedia, the free encyclopedia
Floating-point values coded as few bits
Floating-pointformats
IEEE 754
Other
Alternatives
Tapered floating point
Computer architecture bit widths
Bit
Application
Binary floating-pointprecision
Decimal floating-pointprecision

Incomputing,minifloats arefloating-point values represented with very fewbits. This reduced precision makes them ill-suited for general-purpose numerical calculations, but they are useful for special purposes such as:

  • Computer graphics, where iterations are small and precision has aesthetic effects.[1]
  • Machine learning, which can be relatively insensitive to numeric precision.bfloat16 and fp8 are common formats.[2]

Additionally, they are frequently encountered as a pedagogical tool in computer-science courses to demonstrate the properties and structures offloating-point arithmetic andIEEE 754 numbers.

Minifloats with 16bits arehalf-precision numbers (opposed tosingle anddouble precision). There are also minifloats with 8 bits or even fewer.[2]

Minifloats can be designed following the principles of theIEEE 754 standard. In this case they must obey the (not explicitly written) rules for the frontier betweensubnormal and normal numbers and must have special patterns forinfinity andNaN. Normalized numbers are stored with abiased exponent. The new revision of the standard,IEEE 754-2008, has16-bit binary minifloats.

Notation

[edit]

A minifloat is usually described using a tuple of four numbers, (S,E,M,B):

  • S is the length of the sign field. It is usually either 0 or 1.
  • E is the length of the exponent field.
  • M is the length of the mantissa (significand) field.
  • B is theexponent bias.

A minifloat format denoted by (S,E,M,B) is, therefore,S +E +M bits long. The (S,E,M,B) notation can be converted to a(B,P,L,U) format as(2,M + 1,B + 1, 2SB) (with IEEE use of exponents).

Example 8-bit float (1.4.3)

[edit]
Layout of an example 8-bit minifloat (1.4.3)
signexponentsignificand
00000000

A minifloat in 1 byte (8 bit) with 1 sign bit, 4 exponent bits and 3 significand bits (in short, a 1.4.3 minifloat) is demonstrated here. Theexponent bias is defined as 7 to center the values around 1 to match other IEEE 754 floats[3][4] so (for most values) the actual multiplier for exponentx is2x−7. All IEEE 754 principles should be valid.[5]

Numbers in a different base are marked as ...base, for example, 1012 = 5. The bit patterns have spaces to visualize their parts.

Representation of zero

[edit]

Zero is represented as zero exponent with a zero mantissa. The zero exponent means zero is a subnormal number with a leading "0." prefix, and with the zero mantissa all bits after the decimal point are zero, meaning this value is interpreted as0.0002×26=0{\displaystyle 0.000_{2}\times 2^{-6}=0}. Floating point numbers use asigned zero, so0{\displaystyle -0} is also available and is equal to positive0{\displaystyle 0}.

0 0000 000 = 01 0000 000 = −0

Subnormal numbers

[edit]

The significand is extended with "0." and the exponent value is treated as 1 higher like the least normalized number:

0 0000 001 = 0.0012 × 21 - 7 = 0.125 × 2−6 = 0.001953125 (least subnormal number)...0 0000 111 = 0.1112 × 21 - 7 = 0.875 × 2−6 = 0.013671875 (greatest subnormal number)

Normalized numbers

[edit]

The significand is extended with "1.":

0 0001 000 = 1.0002 × 21 - 7 = 1 × 2−6 = 0.015625 (least normalized number)0 0001 001 = 1.0012 × 21 - 7 = 1.125 × 2−6 = 0.017578125...0 0111 000 = 1.0002 × 27 - 7 = 1 × 20 = 10 0111 001 = 1.0012 × 27 - 7 = 1.125 × 20 = 1.125 (least value above 1)...0 1110 000 = 1.0002 × 214 - 7 =  1.000 × 27 =  1280 1110 001 = 1.0012 × 214 - 7 =  1.125 × 27 =  144...0 1110 110 = 1.1102 × 214 - 7 =  1.750 × 27 = 2240 1110 111 = 1.1112 × 214 - 7 =  1.875 × 27 = 240 (greatest normalized number)

Infinity

[edit]

Infinity values have the highest exponent, with the mantissa set to zero. The sign bit can be either positive or negative.

0 1111 000 = +infinity1 1111 000 = −infinity

Not a number

[edit]

NaN values have the highest exponent, with a non-zero value for the mantissa. A float with 1-bit sign and 3-bit mantissa has2×(231)=14{\displaystyle 2\times (2^{3}-1)=14} NaN values.

s 1111mmm = NaN (ifmmm ≠ 000)

Table of values

[edit]

This is a chart of all possible values for this example 8-bit float.

… 000… 001… 010… 011… 100… 101… 110… 111
0 0000 …00.0019531250.003906250.0058593750.00781250.0097656250.011718750.013671875
0 0001 …0.0156250.0175781250.019531250.0214843750.02343750.0253906250.027343750.029296875
0 0010 …0.031250.035156250.03906250.042968750.0468750.050781250.05468750.05859375
0 0011 …0.06250.07031250.0781250.08593750.093750.10156250.1093750.1171875
0 0100 …0.1250.1406250.156250.1718750.18750.2031250.218750.234375
0 0101 …0.250.281250.31250.343750.3750.406250.43750.46875
0 0110 …0.50.56250.6250.68750.750.81250.8750.9375
0 0111 …11.1251.251.3751.51.6251.751.875
0 1000 …22.252.52.7533.253.53.75
0 1001 …44.555.566.577.5
0 1010 …89101112131415
0 1011 …1618202224262830
0 1100 …3236404448525660
0 1101 …6472808896104112120
0 1110 …128144160176192208224240
0 1111 …InfNaNNaNNaNNaNNaNNaNNaN
1 0000 …−0−0.001953125−0.00390625−0.005859375−0.0078125−0.009765625−0.01171875−0.013671875
1 0001 …−0.015625−0.017578125−0.01953125−0.021484375−0.0234375−0.025390625−0.02734375−0.029296875
1 0010 …−0.03125−0.03515625−0.0390625−0.04296875−0.046875−0.05078125−0.0546875−0.05859375
1 0011 …−0.0625−0.0703125−0.078125−0.0859375−0.09375−0.1015625−0.109375−0.1171875
1 0100 …−0.125−0.140625−0.15625−0.171875−0.1875−0.203125−0.21875−0.234375
1 0101 …−0.25−0.28125−0.3125−0.34375−0.375−0.40625−0.4375−0.46875
1 0110 …−0.5−0.5625−0.625−0.6875−0.75−0.8125−0.875−0.9375
1 0111 …−1−1.125−1.25−1.375−1.5−1.625−1.75−1.875
1 1000 …−2−2.25−2.5−2.75−3−3.25−3.5−3.75
1 1001 …−4−4.5−5−5.5−6−6.5−7−7.5
1 1010 …−8−9−10−11−12−13−14−15
1 1011 …−16−18−20−22−24−26−28−30
1 1100 …−32−36−40−44−48−52−56−60
1 1101 …−64−72−80−88−96−104−112−120
1 1110 …−128−144−160−176−192−208−224−240
1 1111 …−InfNaNNaNNaNNaNNaNNaNNaN

There are only 242 different non-NaN values (if +0 and −0 are regarded as different), because 14 of the bit patterns represent NaNs.

Alternative bias values

[edit]

At these small sizes other bias values may be interesting, for instance a bias of -2 will make the numbers 0-16 have the same bit representation as the integers 0-16, with the loss that no non-integer values can be represented.

0 0000 000 = 0.0002 × 21 - (-2) = 0.0 × 23 = 0 (subnormal number)0 0000 001 = 0.0012 × 21 - (-2) = 0.125 × 23 = 1 (subnormal number)0 0000 111 = 0.1112 × 21 - (-2) = 0.875 × 23 = 7 (subnormal number)0 0001 000 = 1.0002 × 21 - (-2) = 1.000 × 23 = 8 (normalized number)0 0001 111 = 1.1112 × 21 - (-2) = 1.875 × 23 = 15 (normalized number)0 0010 000 = 1.0002 × 22 - (-2) = 1.000 × 24 = 16 (normalized number)

Different bit allocations

[edit]

The above describes an example 8-bit float with 1 sign bit, 4 exponent bits, and 3 significand bits, which is a nice balance. However, any bit allocation is possible. A format could choose to give more of the bits to the exponent if they need more dynamic range with less precision, or give more of the bits to the significand if they need more precision with less dynamic range. At the extreme, it is possible to allocate all bits to the exponent, or all but one of the bits to the significand, leaving the exponent with only one bit. The exponent must be given at least one bit, or else it no longer makes sense as a float, it just becomes asigned number.

Here is a chart of all possible values for a different 8-bit float with 1 sign bit, 3 exponent bits and 4 significand bits. Having 1 more significand bit than exponent bits ensures that the precision remains at least 0.5 throughout the entire range.[6]

… 0000… 0001… 0010… 0011… 0100… 0101… 0110… 0111… 1000… 1001… 1010… 1011… 1100… 1101… 1110… 1111
0 000 …00.0156250.031250.0468750.06250.0781250.093750.1093750.1250.1406250.156250.1718750.18750.2031250.218750.234375
0 001 …0.250.2656250.281250.2968750.31250.3281250.343750.3593750.3750.3906250.406250.4218750.43750.4531250.468750.484375
0 010 …0.50.531250.56250.593750.6250.656250.68750.718750.750.781250.81250.843750.8750.906250.93750.96875
0 011 …11.06251.1251.18751.251.31251.3751.43751.51.56251.6251.68751.751.81251.8751.9375
0 100 …22.1252.252.3752.52.6252.752.87533.1253.253.3753.53.6253.753.875
0 101 …44.254.54.7555.255.55.7566.256.56.7577.257.57.75
0 110 …88.599.51010.51111.51212.51313.51414.51515.5
0 111 …InfNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1 000 …−0−0.015625−0.03125−0.046875−0.0625−0.078125−0.09375−0.109375−0.125−0.140625−0.15625−0.171875−0.1875−0.203125−0.21875−0.234375
1 001 …−0.25−0.265625−0.28125−0.296875−0.3125−0.328125−0.34375−0.359375−0.375−0.390625−0.40625−0.421875−0.4375−0.453125−0.46875−0.484375
1 010 …−0.5−0.53125−0.5625−0.59375−0.625−0.65625−0.6875−0.71875−0.75−0.78125−0.8125−0.84375−0.875−0.90625−0.9375−0.96875
1 011 …−1−1.0625−1.125−1.1875−1.25−1.3125−1.375−1.4375−1.5−1.5625−1.625−1.6875−1.75−1.8125−1.875−1.9375
1 100 …−2−2.125−2.25−2.375−2.5−2.625−2.75−2.875−3−3.125−3.25−3.375−3.5−3.625−3.75−3.875
1 101 …−4−4.25−4.5−4.75−5−5.25−5.5−5.75−6−6.25−6.5−6.75−7−7.25−7.5−7.75
1 110 …−8−8.5−9−9.5−10−10.5−11−11.5−12−12.5−13−13.5−14−14.5−15−15.5
1 111 …−InfNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN

Tables like the above can be generated for any combination of SEMB (sign, exponent, mantissa/significand, and bias) values using a scriptin Python orin GDScript.

Arithmetic

[edit]

Addition

[edit]
Addition of (1.3.2.3)-minifloats

The graphic demonstrates the addition of even smaller (1.3.2.3)-minifloats with 6 bits. This floating-point system follows the rules of IEEE 754 exactly. NaN as operand produces always NaN results. Inf − Inf and (−Inf) + Inf results in NaN too (green area). Inf can be augmented and decremented by finite values without change. Sums with finite operands can give an infinite result (i.e. 14.0 + 3.0 = +Inf as a result is the cyan area, −Inf is the magenta area). The range of the finite operands is filled with the curvesx +y =c, wherec is always one of the representable float values (blue and red for positive and negative results respectively).

Subtraction, multiplication and division

[edit]

The other arithmetic operations can be illustrated similarly:

  • Subtraction
    Subtraction
  • Multiplication
    Multiplication
  • Division
    Division

Other sizes

[edit]

TheRadeon R300 andR420 GPUs used an "fp24" floating-point format with 7 bits of exponent and 16 bits (+1 implicit) of mantissa.[7]"Full Precision" in Direct3D 9.0 is a proprietary 24-bit floating-point format. Microsoft's D3D9 (Shader Model 2.0) graphicsAPI initially supported both FP24 (as in ATI's R300 chip) and FP32 (as in Nvidia's NV30 chip) as "Full Precision", as well as FP16 as "Partial Precision" for vertex and pixel shader calculations performed by the graphics hardware.

Khronos defines 10-bit and 11-bit float formats for use with Vulkan. Both formats have no sign bit and a 5-bit exponent. The 10-bit format has a 5-bit mantissa, and the 11-bit format has a 6-bit mantissa.[8][9]

IEEE SA Working Group P3109 is currently working on a standard for 8-bit minifloats optimized for machine learning. The current draft defines not one format, but a family of 7 different formats, named "binary8pP", where "P" is a number from 1 to 7. These floats are designed to be compact and efficient, but do not follow the same semantics as other IEEE floats, and are missing features such as negative zero and multiple NaN values. Infinity is defined as both the exponent and significand having all ones, instead of other IEEE floats where the exponent is all ones and the significand is all zeroes.[10]

4 bits and fewer

[edit]

The smallest possible float size that follows all IEEE principles, including normalized numbers, subnormal numbers, signed zero, signed infinity, and multiple NaN values, is a 4-bit float with 1-bit sign, 2-bit exponent, and 1-bit mantissa.[11] In the table below, the columns have different values for the sign and mantissa bits, and the rows are different values for the exponent bits.

0 … 00 … 11 … 01 … 1
… 00 …00.5−0−0.5
… 01 …11.5−1−1.5
… 10 …23−2−3
… 11 …InfNaN−InfNaN

If normalized numbers are not required, the size can be reduced to 3-bit by reducing the exponent down to 1.

0 … 00 … 11 … 01 … 1
… 0 …01−0−1
… 1 …InfNaN−InfNaN

In situations where the sign bit can be excluded, each of the above examples can be reduced by 1 bit further, keeping only the left half of the above tables. A 2-bit float with 1-bit exponent and 1-bit mantissa would only have 0, 1, Inf, NaN values.

If the mantissa is allowed to be 0-bit, a 1-bit float format would have a 1-bit exponent, and the only two values would be 0 and Inf. The exponent must be at least 1 bit or else it no longer makes sense as a float (it would just be asigned number).

4-bit floating point numbers — without the four special IEEE values — have found use in acceleratinglarge language models.[12][13]

0 … 00 … 11 … 01 … 1
… 00 …00.5−0−0.5
… 01 …11.5−1−1.5
… 10 …23−2−3
… 11 …46−4-6

In embedded devices

[edit]

Minifloats are also commonly used in embedded devices,[citation needed] especially onmicrocontrollers where floating-point will need to be emulated in software. To speed up the computation, the mantissa typically occupies exactly half of the bits, so the register boundary automatically addresses the parts without shifting.

See also

[edit]

References

[edit]
  1. ^Mocerino, Luca; Calimera, Andrea (24 November 2021)."AxP: A HW-SW Co-Design Pipeline for Energy-Efficient Approximated ConvNets via Associative Matching".Applied Sciences.11 (23): 11164.doi:10.3390/app112311164.
  2. ^abhttps://developer.nvidia.com/blog/nvidia-arm-and-intel-publish-fp8-specification-for-standardization-as-an-interchange-format-for-ai/ (joint announcement by Intel, NVIDIA, Arm);https://arxiv.org/abs/2209.05433 (preprint paper jointly written by researchers from aforementioned 3 companies)
  3. ^IEEE half-precision has 5 exponent bits with bias 15 (2511=15{\displaystyle 2^{5-1}-1=15}),IEEE single-precision has 8 exponent bits with bias 127 (2811=127{\displaystyle 2^{8-1}-1=127}),IEEE double-precision has 11 exponent bits with bias 1023 (21111=1023{\displaystyle 2^{11-1}-1=1023}), andIEEE quadruple-precision has 15 exponent bits with bias 16383 (21511=16383{\displaystyle 2^{15-1}-1=16383}). See theExponent bias article for more detail.
  4. ^O'Hallaron, David R.;Bryant, Randal E. (2010).Computer systems: a programmer's perspective (2 ed.). Boston, Massachusetts, USA:Prentice Hall.ISBN 978-0-13-610804-7.
  5. ^Burch, Carl."Floating-point representation". Hendrix College. Retrieved29 August 2023.
  6. ^https://people.cs.umass.edu/~verts/cmpsci145/8-Bit_Floating_Point.pdf[bare URL PDF]
  7. ^Buck, Ian (13 March 2005),"Chapter 32. Taking the Plunge into GPU Computing", in Pharr, Matt (ed.),GPU Gems, Addison-Wesley,ISBN 0-321-33559-7, retrieved5 April 2018.
  8. ^Garrard, Andrew."10.3. Unsigned 10-bit floating-point numbers".Khronos Data Format Specification v1.2 rev 1. Khronos Group. Retrieved10 August 2023.
  9. ^Garrard, Andrew."10.2. Unsigned 11-bit floating-point numbers".Khronos Data Format Specification v1.2 rev 1. Khronos Group. Retrieved10 August 2023.
  10. ^"IEEE Working Group P3109 Interim Report on 8-bit Binary Floating-point Formats"(PDF).GitHub. IEEE Working Group P3109. Archived from the original on 7 May 2024. Retrieved7 May 2024.{{cite web}}: CS1 maint: bot: original URL status unknown (link)
  11. ^Shaneyfelt, Dr. Ted."Dr. Shaneyfelt's Floating Point Construction Gizmo". Dr. Ted Shaneyfelt. Retrieved29 August 2023.
  12. ^"Accelerate LLM Inference on Your Local PC".
  13. ^"OCP Microscaling Formats (MX) Specification".Open Compute Project. Archived fromthe original on 24 February 2024. Retrieved21 February 2025.

External links

[edit]
Uninterpreted
Numeric
Pointer
Text
Composite
Other
Related
topics
Retrieved from "https://en.wikipedia.org/w/index.php?title=Minifloat&oldid=1277015228"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp