Valige vasakul üks märksõnadest ...

Numerical ComputationError

Lugemise aeg: ~55 min

Error is the discrepancy between a quantity and the value used to represent it in the program. A result is accurate if its error is small. If \widehat{A} is an approximation for A, then

  • the absolute error is \widehat{A} - A, and
  • the relative error is \frac{\widehat{A} - A}{A}.

We are usually more interested in relative error, since the relevance of an error is usually in proportion to the quantity being represented. For example, misreporting the weight of an animal by one kilogram would be much more significant if the animal were a squirrel than if it were a blue whale.

The expression sqrt(200.0), which returns the Float64-square-root of 200, yields

\begin{align*} 14.1421356237309510106570087373256683349609375\overline{0}. \end{align*}

The actual decimal representation of \sqrt{200} is

\begin{align*} 14.1421356237309504880168872420969807856967187\ldots \end{align*}

The difference between these values, 5.23 \times 10^{-16}, is the absolute error, and \frac{5.23 \times 10^{-16}}{\sqrt{200}} = 3.7\times 10^{-17} is the relative error.

Sources of Numerical Error

There are a few categories of numerical error.

Roundoff error comes from rounding numbers to fit them into a floating point representation.

0.2 + 0.1is equal to 0.300000000000000444089209850062616169452667236328125\overline{0} in Float64 arithmetic. The discrepancy between 0.3 and this value is roundoff error.

Truncation error comes from using approximate mathematical formulas or algorithms.

The Maclaurin series of \sin x is x - \frac{x^3}{3!} + \frac{x^5}{5!} - \frac{x^7}{7!} + \cdots, so approximating \sin(0.1) as 0.1 - \frac{0.1^3}{6} yields a truncation error equal to \frac{0.1^5}{5!} - \frac{0.1^7}{7!} + \cdots.

Newton's method approximates a zero of a function f by starting with a value x_0 near the desired zero and defining x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)} for all n \geq 0.

Under certain conditions, x_n converges to a zero of f as n\to\infty. The discrepancy between x_n and \lim_{n\to\infty}x_n is the truncation error associated with stopping Newton's method at the $n$th iteration.

We may approximate \int_0^1 \sin(x^2) \mathrm{d}x using the sum \displaystyle{\sum_{k=1}^{100} \sin\left(\left(\frac{k}{100}\right)^2\right)\frac{1}{100}}. The error associated this approximation is a type of truncation error.

Statistical error arises from using randomness in an approximation.

We can approximate the average height of a population of 100,000 people by selecting 100 people uniformly at random and averaging their measured heights. The error associated with this approximation is an example of statistical error.

Discuss the error in each of the following scenarios using the terms roundoff error, truncation error, or statistical error.

  • We use the trapezoid rule with 1000 trapezoids to approximate \displaystyle \int_0^{10} \frac{1}{4+x^4} \mathrm{d}x.
  • We are trying to approximate f'(5) for some function f that we can compute, and we attempt to do so by running (f(5 + 0.5^100) - f(5))/0.5^100. We fail to get a reasonable answer.
  • To approximate the minimum of a function f:[0,1] \to \mathbb{R}, we evaluate f at 100 randomly selected points in [0,1] and return the smallest value obtained.


  • The more trapezoids we use, the more accurate our answer will be. The difference between the exact answer and the value we get when we stop at 1000 trapezoids is truncation error.
  • The real problem here is roundoff error. 5 + 0.5^100 gets rounded off to 5.0, so the numerator will always evaluate to 0. However, even if we used a BigFloat version of each of these values, there would still be truncation error in this approximation, since the expression we used was obtained by cutting off the limit in the definition of the derivative at a small but positive increment size.
  • This is an example of statistical error, since the output of the algorithm depends on the randomness we use to select the points.

Condition Number

The derivative of a single-variable function may be thought of as a measure of how the function stretches or compresses absolute error:

\begin{align*}f'(x) \approx \frac{\overbrace{f(x+h)-f(x)}^{\text{absolute error of output}}}{\underbrace{h}_\text{absolute error of input}}\end{align*}

The condition number of a function measures how it stretches or compresses relative error. Just as the derivative helps us understand how small changes in input transform to small changes in output, the condition number tells us how a small relative error in the initial data of a problem affects the relative error of the solution. We will use the variable a to denote a problem's initial data and S(a) to denote the solution of the problem with initial data a.

The condition number of a function is defined to be the absolute value of the ratio of the relative change in output of the function to a very small relative change in the input. The condition number of a problem is the condition number of the function which maps the problem's initial data to its solution.

If S is the map from the initial data a\in \mathbb{R} of a problem to its solution \mathbf{S}(a)\in \mathbb{R}^n, then the condition number \kappa of the problem is

\begin{align*}\kappa(a) = \frac{|a||\frac{d}{da}\mathbf{S}(a)|}{|\mathbf{S}(a)|}.\end{align*}

Show that the condition number of a\mapsto a^n is constant, for any n \in \mathbb{R}.

Solution. We have

\begin{align*}\kappa(a) = \frac{a na^{n-1}}{a^n} = n,\end{align*}

for all a\in \mathbb{R}.

Show that the condition number of the function a\mapsto a - 1 is very large for values of a near 1.

Solution. We substitute into the formula for condition number and get

\begin{align*}\kappa(a) = \frac{a}{|a-1|}\end{align*}

for values of a near 1. This expression goes to infinity as a \to 1, so the condition number is very large.

Subtracting 1 from two numbers near 1 preserves their difference, but the size of this difference is increased because the numbers themselves are much smaller.

If a \neq 0, then the solution of the equation ax + 1 = 0 is x = -1/a. If we change the initial data a to a(1+r), then the solution changes to -\frac{1}{a(1+r)}, which represents a relative change of

\begin{align*}\frac{-\frac{1}{a(1+r)} - \left(-\frac{1}{a}\right)}{-1/a} = -\frac{r}{1+r}\end{align*}

in the solution. The relative change in input is (a(1+r) - a)/a) = r, so taking the absolute value of the ratio of -\frac{1}{1+r} to r and sending r \to 0, we see that condition number of this problem is \boxed{1}.

Consider a function S: \mathbb{R} \to \mathbb{R}. If the input changes from a to a + \Delta a for some small value \Delta a, then the output changes to approximately S(a) + \frac{d}{da}S(a) \Delta a. Calculate the ratio of the relative change in the output to the relative change in the input, and show that you get


Solution. The relative change in output is

\begin{align*}\frac{\frac{d}{da}S(a) \Delta a}{S(a)},\end{align*}

and the relative change in input is \Delta a / a. Dividing these two quantities gives


as desired.

More generally, if the initial data is in \mathbb{R}^n and the solution is in \mathbb{R}^m, then the condition number is defined to be

\begin{align*}\kappa(\mathbf{a}) = \frac{|\mathbf{a}|\|\frac{\partial S}{\partial \mathbf{a}}(\mathbf{a})\|}{|\mathbf{S}(\mathbf{a})|},\end{align*}

where | \cdot | denotes the operator norm. The operator norm of the derivative is the appropriate generalization of the norm of the derivative of \mathbf{S} since it measures the maximum stretching factor of \mathbf{S} near \mathbf{a}.

Well conditioned and ill-conditioned problems

If the condition number of a problem is very large, then small errors in the problem data lead to large changes in the result. A problem with large condition number is said to be ill-conditioned. Unless the initial data can be specified with correspondingly high precision, it will not be possible to solve the problem meaningfully.

Example Consider the following matrix equation for x and y.

\begin{align*}\begin{bmatrix} a & 3 \\ 6 & 9 \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} 4 \\ 5 \end{bmatrix}\end{align*}

Find the values of a for which solving this equation for [x,y] is ill-conditioned.

Solution. If a \neq 2, then the solution of this equation is

\begin{align*} \renewcommand{\arraystretch}{1.5} \begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} \frac{7}{3 \left(a - 2\right)} \\ \frac{5 a - 24}{9 \left(a - 2\right)} \end{bmatrix}\end{align*}

Using the formula for \kappa above, we can work out (after several steps) that

\begin{align*}\kappa(a) = \frac{7|a|\sqrt{13}}{|a-2|\sqrt{(5a-24)^2+441}}.\end{align*}

If a is very close to 2, then \kappa(a) is very large, and the matrix is ill-conditioned:

[2.01 3; 6 9] \ [4; 5]
[2.02 3; 6 9] \ [4; 5]

Machine epsilon

Machine epsilon, denoted \epsilon_{\text{mach}}, is the maximum relative error associated with rounding a real number to the nearest value representable as a given floating point type. For Float64, this value is \epsilon_{\text{mach}} = 2^{-53} \approx 1.11 \times 10^{-16}.

A competing convention—more widely used outside academia—defines \epsilon_{\text{mach}} to be the difference between 1 and the next representable number, which for Float64is 2^{-52}. This is the value returned by eps() in Julia. Since we typically introduce a relative error on the order of \epsilon_{\text{mach}} to encode the initial data of a problem, the relative error of the computed solution should be expected to be no smaller than \kappa \epsilon_{\text{mach}}, regardless of the algorithm used.

Algorithm stability

An algorithm used to solve a problem is stable if it is approximately as accurate as the condition number of the problem allows. In other words, an algorithm is unstable if the answers it produces have relative error many times larger than \kappa \epsilon_{\text{mach}}.

Consider the problem of evaluating f(x) = \sqrt{1+x} - 1 near for values of x near 0. Show that the problem is well-conditioned, but algorithm of evaluating the expression \sqrt{1+x} - 1 following the order of operations is unstable.

Comment on whether there are stable algorithms for evaluating f(x) near x = 0.

Solution. Substituting this function into the condition number formula, we find that

\begin{align*}\kappa(x) = \frac{\sqrt{1+x}+1}{2\sqrt{1+x}}.\end{align*}

Therefore, \kappa(0) = 1, which means that this problem is well-conditioned at 0. However, the algorithm of substituting directly includes an ill-conditioned step: subtracting 1.

What's happening is that a roundoff error of approximately \epsilon_{\mathrm{mach}} is introduced when 1+x is rounded to the nearest Float64. When 1 is subtracted, we still have an error of around \epsilon_{\mathrm{mach}}. Since \sqrt{1+x} \approx 1 + x/2, we will have f(x) \approx x/2, and that means that the relative error in the value we find for f(x) will be approximately 2\epsilon_{\mathrm{mach}}/x. If x is small, this will be many times larger than \epsilon_{\mathrm{mach}}.

There are stable algorithms for approximating f(x) near x=0. For example, we could use the Taylor series

\begin{align*}\sqrt{1+x} = 1 + \frac{x}{2} - \frac{x^{2}}{8} + \frac{x^{3}}{16} - \frac{5 x^{4}}{128} + O\left(x^{5}\right)\end{align*}

and approximate f(x) as a sum of the first several terms on the right-hand side. Since power functions are well-conditioned (and performing the subtractions is also well-conditioned as long as x is small enough that each term is much smaller than the preceding one), this algorithm is stable. Alternatively, we can use the identity

\begin{align*}\sqrt{1+x} - 1 = \frac{x}{\sqrt{1+x}+1},\end{align*}

which can be obtained by multiplying by \frac{\sqrt{1+x} + 1}{\sqrt{1+x} + 1} and simplifying the numerator. Substituting into this expression is stable, because adding 1, square rooting, and reciprocating are well-conditioned.

Matrix condition number

The condition number of an m\times n matrix A is defined to be the maximum condition number of the function \mathbf{x} \mapsto A \mathbf{x} as \mathbf{x} ranges over \mathbb{R}^n. The condition number of A can be computed using its singular value decomposition:

Show that the condition number of a matrix A is equal to the ratio of its largest and smallest singular values.

Interpret your results by explaining how to choose two vectors with small relative difference which are mapped to two vectors with large relative difference by A, assuming that A has a singular value which is many times larger than another. Use the figure below to help with the intuition.

Solution. The derivative of the transformation \mathbf{x} \mapsto A \mathbf{x} is the matrix A itself, and the operator norm of A is equal to its largest singular value. Therefore, to maximize \kappa, we minimize the ratio |S(\mathbf{a})|/|\mathbf{a}|. This ratio is minimized when \mathbf{a} is the right singular vector with the least singular value. Therefore, the maximum possible value of \kappa is the ratio of the largest singular value of A to the smallest singular value of A.

Find the condition number of the function \mathbf{x}\mapsto A\mathbf{x}, where A = [1 2; 3 4] and show that there is a vector \mathbf{v} and an error \mathbf{e} for which the relative error is indeed magnified by approximately the condition number of A.

using LinearAlgebra
A = [1 2; 3 4]

Solution. We choose v and e to be the columns of V in the singular value decomposition of A:

using LinearAlgebra
A = [1 2; 3 4]
U, S, V = svd(A)
σmax, σmin = S
κ = σmax/σmin
v = V[:,2]
e = V[:,1]
rel_error_output = norm(A*(v+e) - A*v)/norm(A*v)
rel_error_input = norm(v + e - v) / norm(v)
rel_error_output / rel_error_input, κ


Integer or floating point arithmetic can overflow, and may do so without warning.


In September 2013, NASA lost touch with the Deep Impact space probe because systems on board tracked time as a 32-bit-signed-integer number of tenth-second increments from January 1, 2000. The number of such increments reached the maximum size of a 32-bit signed integer in August of 2013.

Errors resulting from performing ill-conditioned subtractions are called catastrophic cancellation.

Approximating \sqrt{10^6 + 1} - \sqrt{10^6} with the result of sqrt(10^6 + 1) - sqrt(10^6), we get a relative error of approximately 10^{-13}, while using 1/(sqrt(10^6 + 1) + sqrt(10^6)) gives a relative error of 5 \times 10^{-17}(more than a thousand times smaller).

Use your knowledge of floating point arithmetic to explain why computing \sqrt{10^6 + 1} - \sqrt{10^6} directly is much less precise than computing \frac{1}{\sqrt{10^6 + 1} + \sqrt{10^6}}.

Solution. The gaps between successive representable positive values get wider as we move on the right on the number line. Therefore, the error of the first calculation is the roundoff error associated with calculating sqrt(10^6+1), which is roughly 10^3 \epsilon_{\mathrm{mach}}.

The relative error in 1/(sqrt(10^6 + 1) + sqrt(10^6)), meanwhile, is approximately the same as the relative error in the calculation of sqrt(10^6 + 1) + sqrt(10^6) (since the condition number of the reciprocal function is approximately 1). This relative error is only about \epsilon_{\mathrm{mach}}.

If you rely on exact comparisons for floating point numbers, be alert to the differences between Float64 arithmetic and real number arithmetic:

function increment(n)
  a = 1.0
    for i = 1:n
      a = a + 0.01
increment(100) > 2
(increment(100) - 2) / eps(2.0)

Each time we add 0.01, we have to round off the result to represent it as a Float64. These roundoff errors accumulate and lead to a result which is two ticks to the right of 2.0.

It is often more appropriate to compare real numbers using ( \approx«tab»), which checks that two numbers x and y differ by at most \sqrt{\epsilon_{\text{mach}}}\mathrm{max}(x,y).

Guess what value the following code block returns. Run it and see what happens. Discuss why your initial guess was correct or incorrect, and suggest a value near 0.1 that you could use in place of 0.1 to get the expected behavior.

function increment_till(t, step=0.1)
  x = 0.5
    while x < t
      x += step

Solution. It's reasonable to guess that the returned value will be 1.0. However, it's actually approximately 1.1. The reason is that adding the Float64 representation of 0.1 ten times starting from 0.0 results in a number slightly smaller than 1.0. It turns out that 0.6 (the real number) is 20% of the way from the Float64 tick before it to the Float64 tick after it:

(1//10 - floor(2^53 * 1//10) // 2^53) * 2^53

This means that the Float64 sum is \frac{1}{5}2^{-53} less than the mathematical sum after adding 0.1 once, then \frac{2}{5}2^{-53} less after adding 0.1 again, and so on. By the time we we get to 1, we've lost a full tick spacing so after 5 iterations, x is equal to 1-2^{-53}.

We could use 1/8 = 0.125 instead of 0.1 to get the expected behavior, since small inverse powers of 2 and their sums with small integers can be represented exactly as 64-bit floating points numbers.

Bruno Bruno