Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Minifloat

From Wikipedia, the free encyclopedia
Floating-point values coded as few bits
Floating-pointformats
IEEE 754
Other
Alternatives
Tapered floating point
Computer architecture bit widths
Bit
Application
Binary floating-pointprecision
Decimal floating-pointprecision

Incomputing,minifloats arefloating-point values represented with very fewbits. This reduced precision makes them ill-suited for general-purpose numerical calculations, but they are useful for special purposes such as:

  • Computer graphics, where human perception of color and light levels has low precision.[1] The 16-bithalf-precision format is very popular.
  • Machine learning, which can be relatively insensitive to numeric precision. 16-bit, 8-bit, and even 4-bit floats are increasingly being used.[2]

Additionally, they are frequently encountered as a pedagogical tool in computer-science courses to demonstrate the properties and structures offloating-point arithmetic andIEEE 754 numbers.

Depending on contextminifloat may mean any size less than 32, any size less or equal to 16, or any size less than 16. The termmicrofloat may mean any size less or equal to 8.[3]

Notation

[edit]

This page uses the notation (S.E.M) to describe a mini float:

  • S is the length of the sign field (0 or 1).
  • E is the length of the exponent field.
  • M is the length of the mantissa (significand) field.

Minifloats can be designed following the principles of theIEEE 754 standard. Almost all use the smallest exponent forsubnormal and normal numbers. Many use the largest exponent forinfinity andNaN, indicated by (special exponent)SE = 1. Some mini floats use this exponent value normally, in which caseSE = 0.

Theexponent biasB = 2E − 1SE. This value insures that all representable numbers have a representablereciprocal.

The notation can be converted to a(B,P,L,U) format as(2,M + 1,SE − 2E − 1 + 1, 2E − 1 − 1).

A common notation used in the field of machine learning is FPn EeMm, where the lowercase letters are replaced by numbers. For example, FP8 E4M3 is the same as (1.4.3).

Usage

[edit]

Many situations that call for floating-point numbers do not actually require a lot of precision. This is typical for high dynamic range graphics and image processing. This is also typical for larger neural networks, a property that has been exploited since the 2020s to allowincreasingly large language models to be trained and deployed. The more "general-purpose" example is the fp16 (1.5.10) inIEEE 754-2008, called "half-precision" (as opposed to 32-bitsingle and 64-bitdouble precision).

Thebfloat16 (1.8.7) format is the first 16 bits of a standard single-precision number and was often used in image processing and machine learning before hardware support was added for other formats.

Graphics

[edit]

TheRadeon R300 andR420 GPUs used an "fp24" floating-point format (1.7.16).[4] "Full Precision" in Direct3D 9.0 is a proprietary 24-bit floating-point format. Microsoft's D3D9 (Shader Model 2.0) graphicsAPI initially supported both FP24 (as in ATI's R300 chip) and FP32 (as in Nvidia's NV30 chip) as "Full Precision", as well as FP16 as "Partial Precision" for vertex and pixel shader calculations performed by the graphics hardware.

In 2016 Khronos defined 10-bit (0.5.5) and 11-bit (0.5.6)unsigned formats for use with Vulkan.[5][6] These can be converted from positive half-precision by truncating the sign and trailing digits.

Microcontroller

[edit]

Minifloats are also commonly used in embedded devices such asmicrocontrollers where floating-point will need to be emulated in software. To speed up the computation, the mantissa typically occupies exactly half of the bits, so the register boundary automatically addresses the parts without shifting (ie (1.3.4) on 4-bit devices).[citation needed]

Machine learning

[edit]

In 2022 NVidia and others announced support for "fp8" format (1.5.2, E5M2). These can be converted from half-precision by truncating the trailing digits. This format supports special values such as NaN and infinity. They also announced a format without Infinity and only two (positive and negative) representations for NaN, the FP8 E4M3 (1.4.3): after all, special values are unnecessary in the inference (forward-running, as opposed to training viabackpropagation) of neural networks.[2] The formats have been made into an industrial standard called OCP-FP8.[7] Further compression such as FP4 E2M1 (1.2.1) has also proven fruitful.[8]

The FP4-E2M1 format
.000.001.010.011.100.101.110.111
0...00.511.52346
1...−0−0.5−1−1.5−2−3−4−6

Since 2023, IEEE SA Working Group P3109 is working on a standard for minifloats optimized for machine learning by systematizing current practice. Interim Report version 3.0 (August 2025) defines a family of many formats under the systematic name "binaryKpP[s/u][e/f]", whereK is the total bit length,P is the number of mantissa bits, s/u (signed/unsigned) refers to whether a sign bit is present, and e/f (extended/finite) refers to whether infinity is included. By convention, s and e may be omitted. To save space for more numbers, there is no such thing as a "negative zero", and there is only one representation for NaN; for signed formats, the NaN can thus use the bit-pattern of what would've been negative zero. For example, the FP4-E2M1 format can be approximated as the following in P3109:[9]

The binary4p2sf format
.000.001.010.011.100.101.110.111
0...00.511.52346
1...NaN−0.5−1−1.5−2−3−4−6

(For binary4p2se, ±6 are replaced by ±Infinity.)

A downside of very small minifloats is that they have very little representabledynamic range. To fix this problem, the machine learning industry has invented "microscaling formats" (MX), a kind ofblock floating-point. In a MX format, a group of 32 minifloats share an additional scaling factor represented by an "E8M0" minifloat (which is able to represent powers of 2 between 2−127 and 2127). MX has been defined for FP8-E5M2, FP8-E4M3, FP6-E3M2 (1.3.2), FP6-E2M3 (1.2.3), and FP4-E2M1.[10]

Examples

[edit]

8-bit (1.4.3)

[edit]

A minifloat in 1 byte (8 bit) with 1 sign bit, 4 exponent bits and 3 significand bits (1.4.3) is demonstrated here. Theexponent bias is defined as 7 to center the values around 1 to match other IEEE 754 floats[11][12] so (for most values) the actual multiplier for exponentx is2x−7. All IEEE 754 principles should be valid.[13] This form is quite common for instruction.[citation needed]

SignExponentSignificand
00000000

Zero is represented as zero exponent with a zero mantissa. The zero exponent means zero is a subnormal number with a leading "0." prefix, and with the zero mantissa all bits after the decimal point are zero, meaning this value is interpreted as0.0002×26=0{\displaystyle 0.000_{2}\times 2^{-6}=0}. Floating point numbers use asigned zero, so0{\displaystyle -0} is also available and is equal to positive0{\displaystyle 0}.

0 0000 000 = 01 0000 000 = −0

For the lowest exponent the significand is extended with "0." and the exponent value is treated as 1 higher like the least normalized number:

0 0000 001 = 0.0012 × 21 - 7 = 0.125 × 2−6 = 0.001953125 (least subnormal number)...0 0000 111 = 0.1112 × 21 - 7 = 0.875 × 2−6 = 0.013671875 (greatest subnormal number)

All other exponents the significand is extended with "1.":

0 0001 000 = 1.0002 × 21 - 7 = 1 × 2−6 = 0.015625 (least normalized number)0 0001 001 = 1.0012 × 21 - 7 = 1.125 × 2−6 = 0.017578125...0 0111 000 = 1.0002 × 27 - 7 = 1 × 20 = 10 0111 001 = 1.0012 × 27 - 7 = 1.125 × 20 = 1.125 (least value above 1)...0 1110 000 = 1.0002 × 214 - 7 =  1.000 × 27 =  1280 1110 001 = 1.0012 × 214 - 7 =  1.125 × 27 =  144...0 1110 110 = 1.1102 × 214 - 7 =  1.750 × 27 = 2240 1110 111 = 1.1112 × 214 - 7 =  1.875 × 27 = 240 (greatest normalized number)

Infinity values have the highest exponent, with the mantissa set to zero. The sign bit can be either positive or negative.

0 1111 000 = +infinity1 1111 000 = −infinity

NaN values have the highest exponent, with the mantissa non-zero.

s 1111mmm = NaN (ifmmm ≠ 000)

This is a chart of all possible values for this example 8-bit float:

… 000… 001… 010… 011… 100… 101… 110… 111
0 0000 …00.0019531250.003906250.0058593750.00781250.0097656250.011718750.013671875
0 0001 …0.0156250.0175781250.019531250.0214843750.02343750.0253906250.027343750.029296875
0 0010 …0.031250.035156250.03906250.042968750.0468750.050781250.05468750.05859375
0 0011 …0.06250.07031250.0781250.08593750.093750.10156250.1093750.1171875
0 0100 …0.1250.1406250.156250.1718750.18750.2031250.218750.234375
0 0101 …0.250.281250.31250.343750.3750.406250.43750.46875
0 0110 …0.50.56250.6250.68750.750.81250.8750.9375
0 0111 …11.1251.251.3751.51.6251.751.875
0 1000 …22.252.52.7533.253.53.75
0 1001 …44.555.566.577.5
0 1010 …89101112131415
0 1011 …1618202224262830
0 1100 …3236404448525660
0 1101 …6472808896104112120
0 1110 …128144160176192208224240
0 1111 …InfNaNNaNNaNNaNNaNNaNNaN
1 0000 …−0−0.001953125−0.00390625−0.005859375−0.0078125−0.009765625−0.01171875−0.013671875
1 0001 …−0.015625−0.017578125−0.01953125−0.021484375−0.0234375−0.025390625−0.02734375−0.029296875
1 0010 …−0.03125−0.03515625−0.0390625−0.04296875−0.046875−0.05078125−0.0546875−0.05859375
1 0011 …−0.0625−0.0703125−0.078125−0.0859375−0.09375−0.1015625−0.109375−0.1171875
1 0100 …−0.125−0.140625−0.15625−0.171875−0.1875−0.203125−0.21875−0.234375
1 0101 …−0.25−0.28125−0.3125−0.34375−0.375−0.40625−0.4375−0.46875
1 0110 …−0.5−0.5625−0.625−0.6875−0.75−0.8125−0.875−0.9375
1 0111 …−1−1.125−1.25−1.375−1.5−1.625−1.75−1.875
1 1000 …−2−2.25−2.5−2.75−3−3.25−3.5−3.75
1 1001 …−4−4.5−5−5.5−6−6.5−7−7.5
1 1010 …−8−9−10−11−12−13−14−15
1 1011 …−16−18−20−22−24−26−28−30
1 1100 …−32−36−40−44−48−52−56−60
1 1101 …−64−72−80−88−96−104−112−120
1 1110 …−128−144−160−176−192−208−224−240
1 1111 …−InfNaNNaNNaNNaNNaNNaNNaN

There are only 242 different non-NaN values (if +0 and −0 are regarded as different), because 14 of the bit patterns represent NaNs.

To convert to/from 8-bit floats in programming languages, libraries or functions are usually required, since this format is not standardized. For example,here is an example implementation in C++ (public domain).

8-bit (1.4.3) withB = −2

[edit]

At these small sizes other bias values may be interesting, for instance a bias of −2 will make the numbers 0–16 have the same bit representation as the integers 0–16, with the loss that no non-integer values can be represented.

0 0000 000 = 0.0002 × 21 - (-2) = 0.0 × 23 = 0 (subnormal number)0 0000 001 = 0.0012 × 21 - (-2) = 0.125 × 23 = 1 (subnormal number)0 0000 111 = 0.1112 × 21 - (-2) = 0.875 × 23 = 7 (subnormal number)0 0001 000 = 1.0002 × 21 - (-2) = 1.000 × 23 = 8 (normalized number)0 0001 111 = 1.1112 × 21 - (-2) = 1.875 × 23 = 15 (normalized number)0 0010 000 = 1.0002 × 22 - (-2) = 1.000 × 24 = 16 (normalized number)

8-bit (1.3.4)

[edit]

Any bit allocation is possible. A format could choose to give more of the bits to the exponent if they need more dynamic range with less precision, or give more of the bits to the significand if they need more precision with less dynamic range. At the extreme, it is possible to allocate all bits to the exponent (1.7.0), or all but one of the bits to the significand (1.1.6), leaving the exponent with only one bit. The exponent must be given at least one bit, or else it no longer makes sense as a float, it just becomes asigned number.

Here is a chart of all possible values for (1.3.4).M ≥ 2E − 1 ensures that the precision remains at least 0.5 throughout the entire range.[14]

… 0000… 0001… 0010… 0011… 0100… 0101… 0110… 0111… 1000… 1001… 1010… 1011… 1100… 1101… 1110… 1111
0 000 …00.0156250.031250.0468750.06250.0781250.093750.1093750.1250.1406250.156250.1718750.18750.2031250.218750.234375
0 001 …0.250.2656250.281250.2968750.31250.3281250.343750.3593750.3750.3906250.406250.4218750.43750.4531250.468750.484375
0 010 …0.50.531250.56250.593750.6250.656250.68750.718750.750.781250.81250.843750.8750.906250.93750.96875
0 011 …11.06251.1251.18751.251.31251.3751.43751.51.56251.6251.68751.751.81251.8751.9375
0 100 …22.1252.252.3752.52.6252.752.87533.1253.253.3753.53.6253.753.875
0 101 …44.254.54.7555.255.55.7566.256.56.7577.257.57.75
0 110 …88.599.51010.51111.51212.51313.51414.51515.5
0 111 …InfNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1 000 …−0−0.015625−0.03125−0.046875−0.0625−0.078125−0.09375−0.109375−0.125−0.140625−0.15625−0.171875−0.1875−0.203125−0.21875−0.234375
1 001 …−0.25−0.265625−0.28125−0.296875−0.3125−0.328125−0.34375−0.359375−0.375−0.390625−0.40625−0.421875−0.4375−0.453125−0.46875−0.484375
1 010 …−0.5−0.53125−0.5625−0.59375−0.625−0.65625−0.6875−0.71875−0.75−0.78125−0.8125−0.84375−0.875−0.90625−0.9375−0.96875
1 011 …−1−1.0625−1.125−1.1875−1.25−1.3125−1.375−1.4375−1.5−1.5625−1.625−1.6875−1.75−1.8125−1.875−1.9375
1 100 …−2−2.125−2.25−2.375−2.5−2.625−2.75−2.875−3−3.125−3.25−3.375−3.5−3.625−3.75−3.875
1 101 …−4−4.25−4.5−4.75−5−5.25−5.5−5.75−6−6.25−6.5−6.75−7−7.25−7.5−7.75
1 110 …−8−8.5−9−9.5−10−10.5−11−11.5−12−12.5−13−13.5−14−14.5−15−15.5
1 111 …−InfNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN

Tables like the above can be generated for any combination of SEMB (sign, exponent, mantissa/significand, and bias) values using a scriptin Python orin GDScript.

6-bit (1.3.2)

[edit]

With only 64 values, it is possible to plot all the values in a diagram, which can be instructive.

These graphics demonstrates math of two 6-bit (1.3.2)-minifloats, following the rules of IEEE 754 exactly. Green X's are NaN results, Cyan X's are +Infinity results, Magenta X's are −Infinity results. The range of the finite results is filled with curves joining equal values, blue for positive and red for negative.

  • Addition
    Addition
  • Subtraction
    Subtraction
  • Multiplication
    Multiplication
  • Division
    Division

4 bit (1.2.1)

[edit]

The smallest possible float size that follows all IEEE principles, including normalized numbers, subnormal numbers, signed zero, signed infinity, and multiple NaN values, is a 4-bit float with 1-bit sign, 2-bit exponent, and 1-bit mantissa.[15]

.000.001.010.011.100.101.110.111
0...00.511.523InfNaN
1...−0−0.5−1−1.5−2−3−InfNaN

This example 4-bit float is very similar to the FP4-E2M1 format used by Nvidia, and the IEEE P3109 binary4p2s* formats, as described above, except for a different allocation of the special Infinity and NaN values. This example uses a traditional allocation following the same principles as other float sizes. Floats intended for application-specific use cases instead of general-purpose interoperability may prefer to use those bit patterns for finite numbers, depending on the needs of the application.

3-bit (1.1.1)

[edit]

If normalized numbers are not required, the size can be reduced to 3-bit by reducing the exponent down to 1.

.00.01.10.11
0...01InfNaN
1...−0−1−InfNaN

2-bit (0.1.1) and 3-bit (0.2.1)

[edit]

In situations where the sign bit can be excluded, each of the above examples can be reduced by 1 bit further, keeping only the first row of the above tables. A 2-bit float with 1-bit exponent and 1-bit mantissa would only have 0, 1, Inf, NaN values.

1-bit (0.1.0)

[edit]

Removing the mantissa would allow only two values: 0 and Inf. Removing the exponent does not work, the above formulae produce 0 and sqrt(2)/2. The exponent must be at least 1 bit or else it no longer makes sense as a float (it would just be asigned number).

See also

[edit]

References

[edit]
  1. ^Mocerino, Luca; Calimera, Andrea (24 November 2021)."AxP: A HW-SW Co-Design Pipeline for Energy-Efficient Approximated ConvNets via Associative Matching".Applied Sciences.11 (23) 11164.doi:10.3390/app112311164.
  2. ^abhttps://developer.nvidia.com/blog/nvidia-arm-and-intel-publish-fp8-specification-for-standardization-as-an-interchange-format-for-ai/ (joint announcement by Intel, NVIDIA, Arm);https://arxiv.org/abs/2209.05433 (preprint paper jointly written by researchers from aforementioned 3 companies)
  3. ^https://www.mrob.com/pub/math/floatformats.html#microfloat Microfloats
  4. ^Buck, Ian (13 March 2005),"Chapter 32. Taking the Plunge into GPU Computing", in Pharr, Matt (ed.),GPU Gems, Addison-Wesley,ISBN 0-321-33559-7, retrieved5 April 2018.
  5. ^Garrard, Andrew."10.3. Unsigned 10-bit floating-point numbers".Khronos Data Format Specification v1.2 rev 1. Khronos Group. Retrieved10 August 2023.
  6. ^Garrard, Andrew."10.2. Unsigned 11-bit floating-point numbers".Khronos Data Format Specification v1.2 rev 1. Khronos Group. Retrieved10 August 2023.
  7. ^"OCP 8-bit Floating Point Specification (OFP8) Revision 1.0". May 2023.
  8. ^"Accelerate LLM Inference on Your Local PC".
  9. ^"IEEE Working Group P3109 Interim Report on Binary Floating-point Formats for Machine Learning"(PDF).GitHub. IEEE Working Group P3109. 5 August 2025. (latest version)
  10. ^"OCP Microscaling Formats (MX) Specification Version 1.0".Open Compute Project. Archived fromthe original on 24 February 2024. Retrieved21 February 2025.
  11. ^IEEE half-precision has 5 exponent bits with bias 15 (2511=15{\displaystyle 2^{5-1}-1=15}),IEEE single-precision has 8 exponent bits with bias 127 (2811=127{\displaystyle 2^{8-1}-1=127}),IEEE double-precision has 11 exponent bits with bias 1023 (21111=1023{\displaystyle 2^{11-1}-1=1023}), andIEEE quadruple-precision has 15 exponent bits with bias 16383 (21511=16383{\displaystyle 2^{15-1}-1=16383}). See theExponent bias article for more detail.
  12. ^O'Hallaron, David R.;Bryant, Randal E. (2010).Computer systems: a programmer's perspective (2 ed.). Boston, Massachusetts, USA:Prentice Hall.ISBN 978-0-13-610804-7.
  13. ^Burch, Carl."Floating-point representation". Hendrix College. Retrieved29 August 2023.
  14. ^"An 8-Bit Floating Point Representation"(PDF). Archived fromthe original(PDF) on 8 July 2024.
  15. ^Shaneyfelt, Dr. Ted."Dr. Shaneyfelt's Floating Point Construction Gizmo". Dr. Ted Shaneyfelt. Retrieved29 August 2023.

External links

[edit]
Uninterpreted
Numeric
Reference
Text
Composite
Other
Related
topics
Retrieved from "https://en.wikipedia.org/w/index.php?title=Minifloat&oldid=1333940487"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2026 Movatter.jp