Movatterモバイル変換

Quadruple-precision floating-point format

From Wikipedia, the free encyclopedia

128-bit computer number format

Floating-point formats
IEEE 754
16-bit:Half (binary16) 32-bit:Single (binary32),decimal32 64-bit:Double (binary64),decimal64 128-bit:Quadruple (binary128),decimal128 256-bit:Octuple (binary256) Extended precision
Other
Minifloat bfloat16 TensorFloat-32 Microsoft Binary Format IBM hexadecimal floating-point PMBus Linear-11 G.711 8-bit floats
Alternatives
Arbitrary precision Block floating point
Tapered floating point
Posit
v t e

Computer architecture bit widths
Bit
1 4 8 12 16 18 24 26 30 31 32 36 45 48 60 64 128 256 512 bit slicing
Application
8 16 32 64
Binary floating-point precision
16 (×½) 24 32 (×1) 40 64 (×2) 80 128 (×4) 256 (×8)
Decimal floating-point precision
32 64 128
v t e

Incomputing,quadruple precision (orquad precision) is a binaryfloating-point–basedcomputer number format that occupies 16 bytes (128 bits) with precision at least twice the 53-bitdouble precision.

This 128-bit quadruple precision is designed for applications needing results in higher than double precision,^[1] and as a primary function, to allow computing double precision results more reliably and accurately by minimising overflow andround-off errors in intermediate calculations and scratch variables.William Kahan, primary architect of the original IEEE 754 floating-point standard noted, "For now the10-byte Extended format is a tolerable compromise between the value of extra-precise arithmetic and the price of implementing it to run fast; very soon two more bytes of precision will become tolerable, and ultimately a 16-byte format ... That kind of gradual evolution towards wider precision was already in view whenIEEE Standard 754 for Floating-Point Arithmetic was framed."^[2]

InIEEE 754-2008 the 128-bit base-2 format is officially referred to asbinary128.

IEEE 754 quadruple-precision binary floating-point format: binary128

[edit]

The IEEE 754 standard specifies abinary128 as having:

Sign bit: 1 bit
Exponent width: 15 bits
Significand precision: 113 bits (112 explicitly stored)

The sign bit determines the sign of the number (including when this number is zero, which issigned). "1" stands for negative.

This gives from 33 to 36 significant decimal digits precision. If a decimal string with at most 33 significant digits is converted to the IEEE 754 quadruple-precision format, giving a normal number, and then converted back to a decimal string with the same number of digits, the final result should match the original string. If an IEEE 754 quadruple-precision number is converted to a decimal string with at least 36 significant digits, and then converted back to quadruple-precision representation, the final result must match the original number.^[3]

The format is written with an implicit lead bit with value 1 unless the exponent is stored with all zeros (used to encodesubnormal numbers and zeros). Thus only 112 bits of thesignificand appear in the memory format, but the total precision is 113 bits (approximately 34 decimal digits:log₁₀(2¹¹³) ≈ 34.016) for normal values; subnormals have gracefully degrading precision down to 1 bit for the smallest non-zero value. The bits are laid out as:

Exponent encoding

[edit]

The quadruple-precision binary floating-point exponent is encoded using anoffset binary representation, with the zero offset being 16383; this is also known as exponent bias in the IEEE 754 standard.

E_min = 0001₁₆ − 3FFF₁₆ = −16382
E_max = 7FFE₁₆ − 3FFF₁₆ = 16383
Exponent bias = 3FFF₁₆ = 16383

Thus, as defined by the offset binary representation, in order to get the true exponent, the offset of 16383 has to be subtracted from the stored exponent.

The stored exponents 0000₁₆ and 7FFF₁₆ are interpreted specially.

Exponent	Significand zero	Significand non-zero	Equation
0000₁₆	0,−0	subnormal numbers	(−1)^signbit × 2⁻¹⁶³⁸² × 0.significandbits₂
0001₁₆, ..., 7FFE₁₆	normalized value		(−1)^signbit × 2^{exponentbits₂ − 16383} × 1.significandbits₂
7FFF₁₆	±∞	NaN (quiet, signaling)

The minimum strictly positive (subnormal) value is 2⁻¹⁶⁴⁹⁴ ≈ 10⁻⁴⁹⁶⁵ and has a precision of only one bit. The minimum positive normal value is 2⁻¹⁶³⁸² ≈3.3621 × 10⁻⁴⁹³² and has a precision of 113 bits, i.e. ±2⁻¹⁶⁴⁹⁴ as well. The maximum representable value is2¹⁶³⁸⁴ − 2¹⁶²⁷¹ ≈1.1897 × 10⁴⁹³².

Quadruple precision examples

[edit]

These examples are given in bitrepresentation, inhexadecimal, of the floating-point value. This includes the sign, (biased) exponent, and significand.

0000 0000 0000 0000 0000 0000 0000 0001₁₆ = 2⁻¹⁶³⁸² × 2⁻¹¹² = 2⁻¹⁶⁴⁹⁴
≈ 6.4751751194380251109244389582276465525 × 10⁻⁴⁹⁶⁶
(smallest positive subnormal number)

0000 ffff ffff ffff ffff ffff ffff ffff₁₆ = 2⁻¹⁶³⁸² × (1 − 2⁻¹¹²)
≈ 3.3621031431120935062626778173217519551 × 10⁻⁴⁹³²
(largest subnormal number)

0001 0000 0000 0000 0000 0000 0000 0000₁₆ = 2⁻¹⁶³⁸²
≈ 3.3621031431120935062626778173217526026 × 10⁻⁴⁹³²
(smallest positive normal number)

7ffe ffff ffff ffff ffff ffff ffff ffff₁₆ = 2¹⁶³⁸³ × (2 − 2⁻¹¹²)
≈ 1.1897314953572317650857593266280070162 × 10⁴⁹³²
(largest normal number)

3ffe ffff ffff ffff ffff ffff ffff ffff₁₆ = 1 − 2⁻¹¹³
≈ 0.9999999999999999999999999999999999037
(largest number less than one)

3fff 0000 0000 0000 0000 0000 0000 0000₁₆ = 1 (one)

3fff 0000 0000 0000 0000 0000 0000 0001₁₆ = 1 + 2⁻¹¹²
≈ 1.0000000000000000000000000000000001926
(smallest number larger than one)

4000 0000 0000 0000 0000 0000 0000 0000₁₆ = 2
c000 0000 0000 0000 0000 0000 0000 0000₁₆ = −2

0000 0000 0000 0000 0000 0000 0000 0000₁₆ = 0
8000 0000 0000 0000 0000 0000 0000 0000₁₆ = −0

7fff 0000 0000 0000 0000 0000 0000 0000₁₆ = infinity
ffff 0000 0000 0000 0000 0000 0000 0000₁₆ = −infinity

3ffd 5555 5555 5555 5555 5555 5555 5555₁₆ ≈ 0.3333333333333333333333333333333333173
(closest approximation to 1/3)

4000 921f b544 42d1 8469 898c c517 01b8₁₆ ≈ 3.1415926535897932384626433832795027975
(closest approximation to π)

4008 74d9 9564 5aa0 0c11 d0cc 9770 5e5b₁₆ ≈ 745.69987158227021999999999999999997147
(closest approximation to the number of
Watts corresponding to 1horsepower)

By default, 1/3 rounds down likedouble precision, because of the odd number of bits in the significand. Thus, the bits beyond the rounding point are0101... which is less than 1/2 of aunit in the last place.

Double-double arithmetic

[edit]

A common software technique to implement nearly quadruple precision usingpairs ofdouble-precision values is sometimes calleddouble-double arithmetic.^[4]^[5]^[6] Using pairs of IEEE double-precision values with 53-bit significands, double-double arithmetic provides operations on numbers with significands of at least^[4]2 × 53 = 106 bits (actually 107 bits^[7] except for some of the largest values, due to the limited exponent range), only slightly less precise than the 113-bit significand of IEEE binary128 quadruple precision. The range of a double-double remains essentially the same as the double-precision format because the exponent has still 11 bits,^[4] significantly lower than the 15-bit exponent of IEEE quadruple precision (a range of1.8 × 10³⁰⁸ for double-double versus1.2 × 10⁴⁹³² for binary128).

In particular, a double-double/quadruple-precision valueq in the double-double technique is represented implicitly as a sumq =x +y of two double-precision valuesx andy, each of which supplies half ofq's significand.^[5] That is, the pair(x,y) is stored in place ofq, and operations onq values(+, −, ×, ...) are transformed into equivalent (but more complicated) operations on thex andy values. Thus, arithmetic in this technique reduces to a sequence of double-precision operations; since double-precision arithmetic is commonly implemented in hardware, double-double arithmetic is typically substantially faster than more generalarbitrary-precision arithmetic techniques.^[4]^[5]

Note that double-double arithmetic has the following special characteristics:^[8]

As the magnitude of the value decreases, the amount of extra precision also decreases. Therefore, the smallest number in the normalized range is narrower than double precision. The smallest number with full precision is1000...0₂ (106 zeros) × 2⁻¹⁰⁷⁴, or1.000...0₂ (106 zeros) × 2⁻⁹⁶⁸. Numbers whose magnitude is smaller than 2⁻¹⁰²¹ will not have additional precision compared with double precision.
The actual number of bits of precision can vary. In general, the magnitude of the low-order part of the number is no greater than a halfULP of the high-order part. If the low-order part is less than half ULP of the high-order part, significant bits (either all 0s or all 1s) are implied between the significand of the high-order and low-order numbers. Certain algorithms that rely on having a fixed number of bits in the significand can fail when using 128-bit long double numbers.
Because of the reason above, it is possible to represent values like1 + 2⁻¹⁰⁷⁴, which is the smallest representable number greater than 1.

In addition to the double-double arithmetic, it is also possible to generate triple-double or quad-double arithmetic if higher precision is required without any higher precision floating-point library. They are represented as a sum of three (or four) double-precision values respectively. They can represent operations with at least 159/161 and 212/215 bits respectively. A natural extension to an arbitrary number of terms (though limited by the exponent range) is calledfloating-point expansions.

A similar technique can be used to produce adouble-quad arithmetic, which is represented as a sum of two quadruple-precision values. They can represent operations with at least 226 (or 227) bits.^[9]

Implementations

[edit]

Quadruple precision is often implemented in software by a variety of techniques (such as the double-double technique above, although that technique does not implement IEEE quadruple precision), since direct hardware support for quadruple precision is, as of 2016^[update], less common (see "Hardware support" below). One can use generalarbitrary-precision arithmetic libraries to obtain quadruple (or higher) precision, but specialized quadruple-precision implementations may achieve higher performance.

Computer-language support

[edit]

A separate question is the extent to which quadruple-precision types are directly incorporated into computerprogramming languages.

Quadruple precision is specified inFortran by thereal(real128) (moduleiso_fortran_env from Fortran 2008 must be used, the constantreal128 is equal to 16 on most processors), or asreal(selected_real_kind(33, 4931)), or in a non-standard way asREAL*16. (Quadruple-precisionREAL*16 is supported by theIntel Fortran Compiler^[10] and by theGNU Fortran compiler^[11] onx86,x86-64, andItanium architectures, for example.)

For theC programming language, ISO/IEC TS 18661-3 (floating-point extensions for C, interchange and extended types) specifies_Float128 as the type implementing the IEEE 754 quadruple-precision format (binary128).^[12] Alternatively, inC/C++ with a few systems and compilers, quadruple precision may be specified by thelong double type, but this is not required by the language (which only requireslong double to be at least as precise asdouble), nor is it common.

As ofC++23, the C++ language defines a<stdfloat> header that contains fixed-width floating-point types. Implementations of these are optional, but if supported,std::float128_t corresponds to quadruple precision.

On x86 and x86-64, the most common C/C++ compilers implementlong double as either 80-bitextended precision (e.g. theGNU C Compiler gcc^[13] and theIntel C++ Compiler with a/Qlong‑double switch^[14]) or simply as being synonymous with double precision (e.g.Microsoft Visual C++^[15]), rather than as quadruple precision. The procedure call standard for theARM 64-bit architecture (AArch64) specifies thatlong double corresponds to the IEEE 754 quadruple-precision format.^[16] On a few other architectures, some C/C++ compilers implementlong double as quadruple precision, e.g. gcc onPowerPC (as double-double^[17]^[18]^[19]) andSPARC,^[20] or theSun Studio compilers on SPARC.^[21] Even iflong double is not quadruple precision, however, some C/C++ compilers provide a nonstandard quadruple-precision type as an extension. For example, gcc provides a quadruple-precision type called__float128 for x86, x86-64 andItanium CPUs,^[22] and onPowerPC as IEEE 128-bit floating-point using the -mfloat128-hardware or -mfloat128 options;^[23] and some versions of Intel's C/C++ compiler for x86 and x86-64 supply a nonstandard quadruple-precision type called_Quad.^[24]

Zig provides support for it with itsf128 type.^[25]

Google's work-in-progress languageCarbon provides support for it with the type calledf128.^[26]

As of 2024,Rust is currently working on adding a newf128 type for IEEE quadruple-precision 128-bit floats.^[27]

Libraries and toolboxes

[edit]

TheGCC quad-precision math library,libquadmath, provides__float128 and__complex128 operations.
TheBoost multiprecision library Boost.Multiprecision provides unified cross-platform C++ interface for__float128 and_Quad types, and includes a custom implementation of the standard math library.^[28]
The Multiprecision Computing Toolbox for MATLAB allows quadruple-precision computations inMATLAB. It includes basic arithmetic functionality as well as numerical methods, dense and sparse linear algebra.^[29]
The DoubleFloats^[30] package provides support for double-double computations for the Julia programming language.
The doubledouble.py^[31] library enables double-double computations in Python.^{[citation needed]}

Mathematica supports IEEE quad-precision numbers: 128-bit floating-point values (Real128), and 256-bit complex values (Complex256).^{[citation needed]}

Hardware support

[edit]

IEEE quadruple precision was added to theIBM System/390 G5 in 1998,^[32] and is supported in hardware in subsequentz/Architecture processors.^[33]^[34] The IBMPOWER9 CPU (Power ISA 3.0) has native 128-bit hardware support.^[23]

Native support of IEEE 128-bit floats is defined inPA-RISC 1.0,^[35] and inSPARC V8^[36] and V9^[37] architectures (e.g. there are 16 quad-precision registers %q0, %q4, ...), but no SPARC CPU implements quad-precision operations in hardware as of 2004^[update].^[38]

Non-IEEE extended-precision (128 bits of storage, 1 sign bit, 7 exponent bits, 112 fraction bits, 8 bits unused) was added to theIBM System/370 series (1970s–1980s) and was available on someSystem/360 models in the 1960s (System/360-85,^[39] -195, and others by special request or simulated by OS software).

TheSiemens 7.700 and 7.500 series mainframes and their successors support the same floating-point formats and instructions as the IBM System/360 and System/370.

TheVAX processor implemented non-IEEE quadruple-precision floating point as its "H Floating-point" format. It had one sign bit, a 15-bit exponent and 112-fraction bits, however the layout in memory was significantly different from IEEE quadruple precision and the exponent bias also differed. Only a few of the earliest VAX processors implemented H Floating-point instructions in hardware, all the others emulated H Floating-point in software.

TheNEC Vector Engine architecture supports adding, subtracting, multiplying and comparing 128-bit binary IEEE 754 quadruple-precision numbers.^[40] Two neighboring 64-bit registers are used. Quadruple-precision arithmetic is not supported in the vector register.^[41]

TheRISC-V architecture specifies a "Q" (quad-precision) extension for 128-bit binary IEEE 754-2008 floating-point arithmetic.^[42] The "L" extension (not yet certified) will specify 64-bit and 128-bit decimal floating point.^[43]

Quadruple-precision (128-bit) hardware implementation should not be confused with "128-bit FPUs" that implementSIMD instructions, such asStreaming SIMD Extensions orAltiVec, which refers to 128-bitvectors of four 32-bit single-precision or two 64-bit double-precision values that are operated on simultaneously.

References

[edit]

^Bailey, David H.; Borwein, Jonathan M. (July 6, 2009)."High-Precision Computation and Mathematical Physics"(PDF).
^Higham, Nicholas (2002)."Designing stable algorithms" in Accuracy and Stability of Numerical Algorithms (2 ed). SIAM. p. 43.
^Kahan, Wiliam (1 October 1987)."Lecture Notes on the Status of IEEE Standard 754 for Binary Floating-Point Arithmetic"(PDF).
^^a ^b ^c ^dYozo Hida, X. Li, and D. H. Bailey,Quad-Double Arithmetic: Algorithms, Implementation, and Application, Lawrence Berkeley National Laboratory Technical Report LBNL-46996 (2000). Also Y. Hida et al.,Library for double-double and quad-double arithmetic (2007).
^^a ^b ^cJ. R. Shewchuk,Adaptive Precision Floating-Point Arithmetic and Fast Robust Geometric Predicates,Discrete & Computational Geometry 18: 305–363, 1997.
^Knuth, D. E.The Art of Computer Programming (2nd ed.). chapter 4.2.3. problem 9.
^Robert Munafo.F107 and F161 High-Precision Floating-Point Data Types (2011).
^128-Bit Long Double Floating-Point Data Type.
^sourceware.orgRe: The state of glibc libm
^"Intel Fortran Compiler Product Brief (archived copy on web.archive.org)"(PDF). Su. Archived from the original on October 25, 2008. Retrieved2010-01-23.
^"GCC 4.6 Release Series - Changes, New Features, and Fixes". Retrieved2010-02-06.
^"ISO/IEC TS 18661-3"(PDF). 2015-06-10. Retrieved2019-09-22.
^i386 and x86-64 Options (archived copy on web.archive.org),Using the GNU Compiler Collection.
^Intel Developer Site.
^MSDN homepage, about Visual C++ compiler.
^"Procedure Call Standard for the ARM 64-bit Architecture (AArch64)"(PDF). 2013-05-22. Archived fromthe original(PDF) on 2019-10-16. Retrieved2019-09-22.
^RS/6000 and PowerPC Options,Using the GNU Compiler Collection.
^Inside Macintosh – PowerPC Numerics.Archived October 9, 2012, at theWayback Machine.
^128-bit long double support routines for Darwin Archived 2017-11-07 at theWayback Machine.
^SPARC Options,Using the GNU Compiler Collection.
^The Math Libraries, Sun Studio 11Numerical Computation Guide (2005).
^Additional Floating Types,Using the GNU Compiler Collection
^^a ^b"GCC 6 Release Series - Changes, New Features, and Fixes". Retrieved2016-09-13.
^Intel C++ Forums (2007).
^"Floats".ziglang.org. Retrieved7 January 2024.
^"Carbon Language's main repository - Language design".GitHub. 2022-08-09. Retrieved2022-09-22.
^Cross, Travis."Tracking Issue for f16 and f128 float types".GitHub. Retrieved2024-07-05.
^"Boost.Multiprecision – float128". Retrieved2015-06-22.
^Holoborodko, Pavel (2013-01-20)."Fast Quadruple Precision Computations in MATLAB". Retrieved2015-06-22.
^"DoubleFloats.jl".GitHub.
^"doubledouble.py".GitHub.
^Schwarz, E. M.; Krygowski, C. A. (September 1999). "The S/390 G5 floating-point unit".IBM Journal of Research and Development.43 (5/6):707–721.CiteSeerX 10.1.1.117.6711.doi:10.1147/rd.435.0707.
^Gerwig, G.; Wetter, H.; Schwarz, E. M.; Haess, J.; Krygowski, C. A.; Fleischer, B. M.; Kroener, M. (May 2004). "The IBM eServer z990 floating-point unit. IBM J. Res. Dev. 48". pp. 311–322.
^Schwarz, Eric (June 22, 2015)."The IBM z13 SIMD Accelerators for Integer, String, and Floating-Point"(PDF). Archived fromthe original(PDF) on July 13, 2015. RetrievedJuly 13, 2015.
^"Implementor support for the binary interchange formats".IEEE. Archived fromthe original on 2017-10-27. Retrieved2021-07-15.
^The SPARC Architecture Manual: Version 8 (archived copy on web.archive.org)(PDF). SPARC International, Inc. 1992. Archived fromthe original(PDF) on 2005-02-04. Retrieved2011-09-24.SPARC is an instruction set architecture (ISA) with 32-bit integer and 32-, 64-, and 128-bit IEEE Standard 754 floating-point as its principal data types.
^Weaver, David L.; Germond, Tom, eds. (1994).The SPARC Architecture Manual: Version 9 (archived copy on web.archive.org)(PDF). SPARC International, Inc. Archived fromthe original(PDF) on 2012-01-18. Retrieved2011-09-24.Floating-point: The architecture provides an IEEE 754-compatible floating-point instruction set, operating on a separate register file that provides 32 single-precision (32-bit), 32 double-precision (64-bit), 16 quad-precision (128-bit) registers, or a mixture thereof.
^"SPARC Behavior and Implementation".Numerical Computation Guide — Sun Studio 10. Sun Microsystems, Inc. 2004. Retrieved2011-09-24.There are four situations, however, when the hardware will not successfully complete a floating-point instruction: ... The instruction is not implemented by the hardware (such as ... quad-precision instructions on any SPARC FPU).
^Padegs, A. (1968). "Structural aspects of the System/360 Model 85, III: Extensions to floating-point architecture".IBM Systems Journal.7:22–29.doi:10.1147/sj.71.0022.
^Vector Engine AssemblyLanguage Reference Manual, Chapter4 Assembler Syntax page 23.
^SX-Aurora TSUBASA Architecture Guide Revision 1.1, pp. 38, 60.
^RISC-V ISA Specification v. 20191213, Chapter 13, “Q” Standard Extension for Quad-Precision Floating-Point, page 79.
^[1] Chapter 15, p. 95.

External links

[edit]

High-Precision Software Directory
QPFloat, afree software (GPL) software library for quadruple-precision arithmetic
HPAlib, a free software (LGPL) software library for quad-precision arithmetic
libquadmath, theGCC quad-precision math library
IEEE-754 Analysis, interactive web page for examining binary32, binary64, and binary128 floating-point values

v t e Data types
Uninterpreted	Bit Byte Trit Tryte Word Bit array
Numeric	Arbitrary-precision or bignum Complex Decimal Fixed point Block floating point Floating point Reduced precision Minifloat Half precision bfloat16 Single precision Double precision Quadruple precision Octuple precision Extended precision Long double Integer signedness Interval Rational
Reference	Address physical virtual Pointer
Text	Character String null-terminated
Composite	Algebraic data type generalized Array Associative array Class Dependent Equality Inductive Intersection List Object metaobject Option type Product Record or Struct Refinement Set Union tagged
Other	Any type Boolean Bottom type Collection Enumerated type Exception Function type Opaque data type Recursive data type Semaphore Stream Strongly typed identifier Type class Empty type Unit type Void
Related topics	Value Abstract data type Boxing Data structure Generic Kind metaclass Parametric polymorphism Primitive data type Interface Subtyping Type constructor Type conversion Type system Type theory Variable