Movatterモバイル変換


[0]ホーム

URL:


Wikipedia

Round-off error

For the acrobatic movement, roundoff, seeRoundoff.

Incomputing, aroundoff error,[1] also calledrounding error,[2] is the difference between the result produced by a givenalgorithm using exactarithmetic and the result produced by the same algorithm using finite-precision,rounded arithmetic.[3] Rounding errors are due to inexactness in the representation ofreal numbers and the arithmetic operations done with them. This is a form ofquantization error.[4] When using approximationequations or algorithms, especially when using finitely many digits to represent real numbers (which in theory have infinitely many digits), one of the goals ofnumerical analysis is toestimate computation errors.[5] Computation errors, also callednumerical errors, include bothtruncation errors and roundoff errors.

When a sequence of calculations with an input involving any roundoff error are made, errors may accumulate, sometimes dominating the calculation. Inill-conditioned problems, significant error may accumulate.[6]

In short, there are two major facets of roundoff errors involved in numerical calculations:[7]

  1. The ability of computers to represent both magnitude and precision of numbers is inherently limited.
  2. Certain numerical manipulations are highly sensitive to roundoff errors. This can result from both mathematical considerations as well as from the way in which computers perform arithmetic operations.

Representation error

edit

The error introduced by attempting to represent a number using a finite string of digits is a form of roundoff error calledrepresentation error.[8] Here are some examples of representation error in decimal representations:

NotationRepresentationApproximationError
170.142 8570.142 8570.000 000142 857
ln 20.693 147 180 559 945 309 41...0.693 1470.000 000 180 559 945 309 41...
log10 20.301 029 995 663 981 195 21...0.30100.000 029 995 663 981 195 21...
321.259 921 049 894 873 164 76...1.259920.000 001 049 894 873 164 76...
21.414 213 562 373 095 048 80...1.414210.000 003 562 373 095 048 80...
e2.718 281 828 459 045 235 36...2.718 281 828 459 0450.000 000 000 000 000 235 36...
π3.141 592 653 589 793 238 46...3.141 592 653 589 7930.000 000 000 000 000 238 46...

Increasing the number of digits allowed in a representation reduces the magnitude of possible roundoff errors, but any representation limited to finitely many digits will still cause some degree of roundoff error foruncountably many real numbers. Additional digits used for intermediary steps of a calculation are known asguard digits.[9]

Rounding multiple times can cause error to accumulate.[10] For example, if 9.945309 is rounded to two decimal places (9.95), then rounded again to one decimal place (10.0), the total error is 0.054691. Rounding 9.945309 to one decimal place (9.9) in a single step introduces less error (0.045309). This can occur, for example, when software performs arithmetic inx86 80-bit floating-point and then rounds the result toIEEE 754 binary64 floating-point.

Floating-point number system

edit

Compared with thefixed-point number system, thefloating-point number system is more efficient in representing real numbers so it is widely used in modern computers. While the real numbersR{\displaystyle \mathbb {R} }  are infinite and continuous, a floating-point number systemF{\displaystyle F}  is finite and discrete. Thus, representation error, which leads to roundoff error, occurs under the floating-point number system.

Notation of floating-point number system

edit

A floating-point number systemF{\displaystyle F}  is characterized by4{\displaystyle 4}  integers:

AnyxF{\displaystyle x\in F}  has the following form:x=±(d0.d1d2dp1significand)β×βEexponent=±d0×βE+d1×βE1++dp1×βE(p1){\displaystyle x=\pm (\underbrace {d_{0}.d_{1}d_{2}\ldots d_{p-1}} _{\text{significand}})_{\beta }\times \beta ^{\overbrace {E} ^{\text{exponent}}}=\pm d_{0}\times \beta ^{E}+d_{1}\times \beta ^{E-1}+\ldots +d_{p-1}\times \beta ^{E-(p-1)}} wheredi{\displaystyle d_{i}}  is an integer such that0diβ1{\displaystyle 0\leq d_{i}\leq \beta -1}  fori=0,1,,p1{\displaystyle i=0,1,\ldots ,p-1} , andE{\displaystyle E}  is an integer such thatLEU{\displaystyle L\leq E\leq U} .

Normalized floating-number system

edit

IEEE standard

edit

In theIEEE standard the base is binary, i.e.β=2{\displaystyle \beta =2} , and normalization is used. The IEEE standard stores the sign, exponent, and significand in separate fields of a floating point word, each of which has a fixed width (number of bits). The two most commonly used levels of precision for floating-point numbers are single precision and double precision.

PrecisionSign (bits)Exponent (bits)Trailing Significand field (bits)
Single1823
Double11152

Machine epsilon

edit

Machine epsilon can be used to measure the level of roundoff error in the floating-point number system. Here are two different definitions.[3]

Roundoff error under different rounding rules

edit

There are two common rounding rules, round-by-chop and round-to-nearest. The IEEE standard uses round-to-nearest.

  • Round-by-chop: The base-β{\displaystyle \beta }  expansion ofx{\displaystyle x}  is truncated after the(p1){\displaystyle (p-1)} -th digit.
    • This rounding rule is biased because it always moves the result toward zero.
  • Round-to-nearest:fl(x){\displaystyle fl(x)}  is set to the nearest floating-point number tox{\displaystyle x} . When there is a tie, the floating-point number whose last stored digit is even (also, the last digit, in binary form, is equal to 0) is used.
    • For IEEE standard where the baseβ{\displaystyle \beta }  is2{\displaystyle 2} , this means when there is a tie it is rounded so that the last digit is equal to0{\displaystyle 0} .
    • This rounding rule is more accurate but more computationally expensive.
    • Rounding so that the last stored digit is even when there is a tie ensures that it is not rounded up or down systematically. This is to try to avoid the possibility of an unwanted slow drift in long calculations due simply to a biased rounding.
  • The following example illustrates the level of roundoff error under the two rounding rules.[3] The rounding rule, round-to-nearest, leads to less roundoff error in general.
xRound-by-chopRoundoff ErrorRound-to-nearestRoundoff Error
1.6491.60.0491.60.049
1.6501.60.0501.60.050
1.6511.60.0511.7−0.049
1.6991.60.0991.7−0.001
1.7491.70.0491.70.049
1.7501.70.0501.8−0.050

Calculating roundoff error in IEEE standard

edit

Suppose the usage of round-to-nearest and IEEE double precision.

Since the 53rd bit to the right of the binary point is a 1 and is followed by other nonzero bits, the round-to-nearest rule requires rounding up, that is, add 1 bit to the 52nd bit. Thus, the normalized floating-point representation in IEEE standard of 9.4 isfl(9.4)=1.0010110011001100110011001100110011001100110011001101×23.{\displaystyle fl(9.4)=1.0010110011001100110011001100110011001100110011001101\times 2^{3}.} 

This representation is derived by discarding the infinite tail0.1100¯×252×23=0.0110¯×251×23=0.4×248{\displaystyle 0.{\overline {1100}}\times 2^{-52}\times 2^{3}=0.{\overline {0110}}\times 2^{-51}\times 2^{3}=0.4\times 2^{-48}}  from the right tail and then added1×252×23=249{\displaystyle 1\times 2^{-52}\times 2^{3}=2^{-49}}  in the rounding step.

Thenfl(9.4)=9.40.4×248+249=9.4+(0.2)10×249{\displaystyle fl(9.4)=9.4-0.4\times 2^{-48}+2^{-49}=9.4+(0.2)_{10}\times 2^{-49}} .
Thus, the roundoff error is(0.2×249)10{\displaystyle (0.2\times 2^{-49})_{10}} .


Measuring roundoff error by using machine epsilon

edit

The machine epsilonϵmach{\displaystyle \epsilon _{\text{mach}}}  can be used to measure the level of roundoff error when using the two rounding rules above. Below are the formulas and corresponding proof.[3] The first definition of machine epsilon is used here.

Theorem

edit
  1. Round-by-chop:ϵmach=β1p{\displaystyle \epsilon _{\text{mach}}=\beta ^{1-p}} 
  2. Round-to-nearest:ϵmach=12β1p{\displaystyle \epsilon _{\text{mach}}={\frac {1}{2}}\beta ^{1-p}} 

Proof

edit

Letx=d0.d1d2dp1dp×βnR{\displaystyle x=d_{0}.d_{1}d_{2}\ldots d_{p-1}d_{p}\ldots \times \beta ^{n}\in \mathbb {R} }  wheren[L,U]{\displaystyle n\in [L,U]} , and letfl(x){\displaystyle fl(x)}  be the floating-point representation ofx{\displaystyle x} . Since round-by-chop is being used, it is|xfl(x)||x|=|d0.d1d2dp1dpdp+1×βnd0.d1d2dp1×βn||d0.d1d2×βn|=|dp.dp+1×βnp||d0.d1d2×βn|=|dp.dp+1dp+2||d0.d1d2|×βp{\displaystyle {\begin{aligned}{\frac {|x-fl(x)|}{|x|}}&={\frac {|d_{0}.d_{1}d_{2}\ldots d_{p-1}d_{p}d_{p+1}\ldots \times \beta ^{n}-d_{0}.d_{1}d_{2}\ldots d_{p-1}\times \beta ^{n}|}{|d_{0}.d_{1}d_{2}\ldots \times \beta ^{n}|}}\\&={\frac {|d_{p}.d_{p+1}\ldots \times \beta ^{n-p}|}{|d_{0}.d_{1}d_{2}\ldots \times \beta ^{n}|}}\\&={\frac {|d_{p}.d_{p+1}d_{p+2}\ldots |}{|d_{0}.d_{1}d_{2}\ldots |}}\times \beta ^{-p}\end{aligned}}} In order to determine the maximum of this quantity, there is a need to find the maximum of the numerator and the minimum of the denominator. Sinced00{\displaystyle d_{0}\neq 0}  (normalized system), the minimum value of the denominator is1{\displaystyle 1} . The numerator is bounded above by(β1).(β1)(β1)¯=β{\displaystyle (\beta -1).(\beta -1){\overline {(\beta -1)}}=\beta } . Thus,|xfl(x)||x|β1×βp=β1p{\displaystyle {\frac {|x-fl(x)|}{|x|}}\leq {\frac {\beta }{1}}\times \beta ^{-p}=\beta ^{1-p}} . Therefore,ϵ=β1p{\displaystyle \epsilon =\beta ^{1-p}}  for round-by-chop.The proof for round-to-nearest is similar.

  • Note that the first definition of machine epsilon is not quite equivalent to the second definition when using the round-to-nearest rule but it is equivalent for round-by-chop.

Roundoff error caused by floating-point arithmetic

edit

Even if some numbers can be represented exactly by floating-point numbers and such numbers are calledmachine numbers, performing floating-point arithmetic may lead to roundoff error in the final result.

Addition

edit

Machine addition consists of lining up the decimal points of the two numbers to be added, adding them, and then storing the result again as a floating-point number. The addition itself can be done in higher precision but the result must be rounded back to the specified precision, which may lead to roundoff error.[3]

This example shows that roundoff error can be introduced when adding a large number and a small number. The shifting of the decimal points in the significands to make the exponents match causes the loss of some of the less significant digits. The loss of precision may be described asabsorption.[11]

Note that the addition of two floating-point numbers can produce roundoff error when their sum is an order of magnitude greater than that of the larger of the two.

This kind of error can occur alongside an absorption error in a single operation.

Multiplication

edit

In general, the product of two p-digit significands contains up to 2p digits, so the result might not fit in the significand.[3] Thus roundoff error will be involved in the result.

Division

edit

In general, the quotient of 2p-digit significands may contain more than p-digits.Thus roundoff error will be involved in the result.

Subtraction

edit

Absorption also applies to subtraction.

The subtracting of two nearly equal numbers is calledsubtractive cancellation.[3] When the leading digits are cancelled, the result may be too small to be represented exactly and it will just be represented as0{\displaystyle 0} .

Even with a somewhat largerϵ{\displaystyle \epsilon } , the result is still significantly unreliable in typical cases. There is not much faith in the accuracy of the value because the most uncertainty in any floating-point number is the digits on the far right.

This is closely related to the phenomenon ofcatastrophic cancellation, in which the two numbers areknown to be approximations.

Accumulation of roundoff error

edit

Errors can be magnified or accumulated when a sequence of calculations is applied on an initial input with roundoff error due to inexact representation.

Unstable algorithms

edit

An algorithm or numerical process is calledstable if small changes in the input only produce small changes in the output, andunstable if large changes in the output are produced.[12] For example, the computation off(x)=1+x1{\displaystyle f(x)={\sqrt {1+x}}-1}  using the "obvious" method is unstable nearx=0{\displaystyle x=0}  due to the large error introduced in subtracting two similar quantities, whereas the equivalent expressionf(x)=x1+x+1{\displaystyle \textstyle {f(x)={\frac {x}{{\sqrt {1+x}}+1}}}}  is stable.[12]

Ill-conditioned problems

edit

Even if a stable algorithm is used, the solution to a problem may still be inaccurate due to the accumulation of roundoff error when the problem itself isill-conditioned.

Thecondition number of a problem is the ratio of the relative change in the solution to the relative change in the input.[3] A problem iswell-conditioned if small relative changes in input result in small relative changes in the solution. Otherwise, the problem isill-conditioned.[3] In other words, a problem isill-conditioned if its conditions number is "much larger" than 1.

The condition number is introduced as a measure of the roundoff errors that can result when solving ill-conditioned problems.[7]

See also

edit

References

edit
  1. ^Butt, Rizwan (2009),Introduction to Numerical Analysis Using MATLAB, Jones & Bartlett Learning, pp. 11–18,ISBN 978-0-76377376-2
  2. ^Ueberhuber, Christoph W. (1997),Numerical Computation 1: Methods, Software, and Analysis, Springer, pp. 139–146,ISBN 978-3-54062058-7
  3. ^abcdefghijForrester, Dick (2018).Math/Comp241 Numerical Methods (lecture notes).Dickinson College.
  4. ^Aksoy, Pelin; DeNardis, Laura (2007),Information Technology in Theory, Cengage Learning, p. 134,ISBN 978-1-42390140-2
  5. ^Ralston, Anthony; Rabinowitz, Philip (2012),A First Course in Numerical Analysis, Dover Books on Mathematics (2nd ed.), Courier Dover Publications, pp. 2–4,ISBN 978-0-48614029-2
  6. ^Chapman, Stephen (2012),MATLAB Programming with Applications for Engineers, Cengage Learning, p. 454,ISBN 978-1-28540279-6
  7. ^abChapra, Steven (2012).Applied Numerical Methods with MATLAB for Engineers and Scientists (3rd ed.).McGraw-Hill.ISBN 9780073401102.
  8. ^Laplante, Philip A. (2000).Dictionary of Computer Science, Engineering and Technology.CRC Press. p. 420.ISBN 978-0-84932691-2.
  9. ^Higham, Nicholas John (2002).Accuracy and Stability of Numerical Algorithms (2 ed.).Society for Industrial and Applied Mathematics (SIAM). pp. 43–44.ISBN 978-0-89871521-7.
  10. ^Volkov, E. A. (1990).Numerical Methods.Taylor & Francis. p. 24.ISBN 978-1-56032011-1.
  11. ^Biran, Adrian B.; Breiner, Moshe (2010). "5".What Every Engineer Should Know About MATLAB and Simulink.Boca Raton,Florida:CRC Press. pp. 193–194.ISBN 978-1-4398-1023-1.
  12. ^abCollins, Charles (2005)."Condition and Stability"(PDF).Department of Mathematics in University of Tennessee. Retrieved2018-10-28.

Further reading

edit
  • Matt Parker (2021).Humble Pi: When Math Goes Wrong in the Real World. Riverhead Books.ISBN 978-0593084694.

External links

edit

[8]ページ先頭

©2009-2025 Movatter.jp