Valige vasakul üks märksõnadest ...

Numerical ComputationMachine Arithmetic

Lugemise aeg: ~60 min

Computers store all information, including numerical values, as sequences of . The type of a numeric value specifies how the underlying sequence of bits should be interpreted as a number.

You can access the bit representation of a numerical value in Julia using the function bitstring. For example, we can inspect how boolean values are represented in Julia:


You might think that boolean values would be stored using a single bit. However, as you can see from the output above, in fact they use eight bits.

In this section, we will introduce several of the most important numeric types.

Humans typically interpret a string of digits as an integer using place value: the units digit is worth 10^0, the next digit to the left is worth 10^1, and so on. Then 709 = 7 \cdot 10^2 + 0 \cdot 10^1 + 9\cdot 10^0, for example. This is called the decimal representation of a number.

We can do the same thing with 2 in place of 10: the rightmost digit is worth 2^0, the next digit is worth 2^1, and so on. Instead of 10 digits we only have two bits: 0 and 1. This is called the binary representation of a number. The binary representation of 13, for example, is 1101, since 13 = 1\cdot 2^3 + 1\cdot 2^2 + 0 \cdot 2^1 + 1 \cdot 2^0.

Find the binary representations of each of the following numbers: 2, 16, 20, and 100.

Solution. The binary representation of 2 is 10, of 16 is 10000, of 20 is 10100, and of 100 is 1100100.

64-bit Integers

The most common numeric types used in modern computers use 64 bits to represent each number. There are length-64 strings of 0's and/or 1's.

Therefore, with 64 bits we can represent 2^{64} numbers. For example, we could represent the integers from 0 to 2^{64}-1 by interpreting each string of 0's and 1's as a binary number. This type exists (in Julia it's called UInt64, for unsigned 64-bit integer), but most of the time it's important to be able to represent negative integers as well.

In order to make room for negatives, we will only use of these bitstrings to represent nonnegative numbers (from 0 to 2^{63}-1), and we will allocate the other half to negative integers (from -2^{63} to -1).

More precisely, for 0 \leq n \leq 2^{63}-1, we represent n using its binary representation, with leading zeros as necessary to get 64 total bits. For 1 \leq n \leq 2^{63}, we represent -n using the binary representation of 2^{64}-n. One good reason to use this convention rather than dedicating one bit to the number's sign and using the other 63 bits to represent the number's absolute value in binary is that .

For the 64-bit signed integer representation, we represent the first 2^{63}-1 values using their usual binary representation, and then we skip down to -2^{63} and count back up to -1.

The expression bitstring(+34) evaluates to


This checks out: 34 = 1\cdot 2^5 + 0 \cdot 2^4 + 0\cdot 2^3 + 0 \cdot 2^2 + 1\cdot 2^1 + 0 \cdot 2^0.

The expression bitstring(-34) evaluates to


We could check that this is the binary representation of 2^{64} - 34, but in the following exercise we will learn a trick for doing that without having to deal with all those 1's.

Show that if 1 \leq n \leq 2^{63}-1, then you can find bitstring(-n) from bitstring(n) by (i) flipping every bit, and (ii) adding 1 to the resulting number (interpreted as an integer represented in binary).

Solution. Recall that for 1 \leq n \leq 2^{63}, $-n$ is represented by the binary representation of 2^{64} - n. Representing 2^{64} - n in binary is not straightforward but 2^{64} - 1 - n is easier because 2^{64} - 1 in binary is just 64 ones. This means that for any 1 \leq n \leq 2^{63} - 1, we can get 2^{64} - 1 - n in binary by changing a 1 to a 0 in the binary representation of 2^{64} - 1 at every position where n in binary has a 1. In other words, to get 2^{64} - 1 - n in binary, we just flip the binary representation of n. Since

\begin{align*} 2^{64} - n = 2^{64} - 1 - n + 1, \end{align*}

we need to add 1 after flipping the bits.

64-bit Floating Point Numbers

Integer types are appropriate for calculations which only involve integer operations (multiplication, addition, and negation), but the only integers with integer reciprocals are . So performing calculations involving division requires a new number type.

Let's visualize our number system by placing a tick mark on the number line for each number we're representing. One simple idea is to choose some small value \epsilon and represent integer multiples of \epsilon:

We can represent non-integer values by interpreting each integer n as XEQUATIONX4476XEQUATIONX.

These are called fixed point numbers, and they have some uses (for example, in financial applications where all monetary values can safely be assumed to be a multiple of one cent).

The problem with fixed point numbers for most scientific computing applications is that when we get very small numbers (for example, when taking the reciprocal of very large numbers), they have to be rounded off a lot relative to the size of the number. For example, if \epsilon = 0.001, then the reciprocal of 1999 would have to be rounded from approximately 0.0005 to 0.001. If that value is then multiplied by, say, 8000, then the round-off would result in a product of 8 instead of approximately 4.

We could address that problem by making \epsilon smaller, but we encounter a tradeoff between representing small numbers accurately and being able to represent large numbers at all. Furthermore, if \epsilon is very small, then large numbers are being represented with unnecessarily high precision (relative to the size of the number). The way out of this tradeoff is to relax the fixed-width increment and using a variable-sized gap between representable numbers. Such number systems are called floating point systems.

One way to represent numbers more densely near zero is to put equally spaced tick marks between 1 and 2, and then scale that interval up repeatedly into [2,4), then [4,8), then [8,16), and so on, and also scale it down to [1/2,1), [1/4,1/2), and so on. Here's an example of such a scheme: we place 8 tick marks between 1 and 2, and then we scale that interval's worth of tick marks four times by a factor of 2, and also 3 times by a factor of \frac{1}{2}.

We can see on the left edge of the (top) picture that we didn't cover zero! There is also quite a large gap between 0 and the smallest representable number in this system. So we replace the leftmost interval between successive powers of 2 with ticks twice as far apart so that they reach down to 0. The locations of these new ticks are called subnormal numbers.

If we have 64 bits, we can do the same thing but on a grander scale. Rather than subdividing the interval [1,2) into 2^3 = 8 equal-length intervals, we use 2^{52} = 4503599627370496 intervals. Rather than scaling than interval up and down just a few times in each direction, we scale up 1023 times—covering every binary interval up to [2^{1023}, 2^{1024})—and down 1022 times—covering every binary interval down to [2^{-1022}, 2^{-1021}). Finally, the subnormal numbers will be equally spaced between 0 and 2^{-1022}.

We can accomplish all of this if we dedicate 11 bits to indicating the index e of the binary interval we're in—starting from e=0 for numbers in the subnormal range and going up to e=2046 = 2^{11}-2 for the last interval [2^{1023}, 2^{1024})—and 52 bits for indicating the index f of the tick within that interval. For example, the number 4 + 18\cdot 2^{-50} would correspond to and .

The nonnegative representable numbers are laid out as shown (figure not drawn to scale!):

The tick marks indicate the positive values which are representable as 64-bit floating point numbers. There are 2^{52} of them between any two successive powers of 2 from XEQUATIONX4477XEQUATIONX up to XEQUATIONX4478XEQUATIONX, and the interval from 0 to XEQUATIONX4479XEQUATIONX also contains 2^{52} representable values (these are the subnormal numbers, indicated in blue).

That leaves us with one bit, which we call \sigma, to indicate the sign of the value we're representing. Also, note that we're leaving out one possible e value (the last one, e = 2047); more on that later.

We can use this description to write down formulas for the real number values represented by each string of 64 bits. We define \sigma to be the first bit of the string, e to be the next 11 bits interpreted as a binary integer, and f to be the remaining 52 bits interpreted as a binary integer. If 0 < e < 2047, then the string represents the number


Note that the exponent e-1023 ranges from -1022 up to 1023 as e ranges from 1 to 2046. If e = 0, then the string represents

\begin{align*} (-1)^\sigma \left(0+\left(\frac{1}{2}\right)^{52}f\right) \cdot 2^{-102\color{red}{2}}, \end{align*}

These are the .

Another way to visualize the floating point number system is to graph the floating point value associated with each binary string against its value as an unsigned integer. Here's an example for an 8-bit floating point scheme, with one sign bit, three bits for the exponent, and four bits for the mantissa. Subnormal numbers are shown in red, and powers of 2 are shown in gold.

We appropriate the last value of e for a special meaning: if e = 2047, then the string represents one of the special values Inf or NaN, depending on the value of f.

bitstring(-0.75) returns


The leading 1 indicates that the number is negative, the next eleven digits 01111111110 give e = 1022, and interpreting the remaining 52 digits as a base-2 fraction (0.1000\ldots)_2 gives f = \frac{1}{2}. So the value represented is

\begin{align*}(-1)\left(1+\frac{1}{2}\right)2^{1022-1023} = -\frac{3}{4}.\end{align*}

Thus -0.75 can be represented exactly as a Float64.

Show that 0.1 cannot be represented exactly as a Float64.

Solution. By construction, every representable Float64 is a rational number whose denominator is a power of 2. Therefore, when a representable Float64 is written as a fraction and simplified, its denominator is a power of 2. Since \frac{1}{10} does not fit this description, it is not representable as a Float64.

The Julia function nextfloat returns the smallest representable value which is larger than its argument. What value will be returned by the code below?

log2(nextfloat(11.0) - 11.0)

Solution. The difference between 11 and the next Float64 is 2^{-52 + 3} = 2^{-49}, since 11 is in the binary interval [2^{3}, 2^4), which is 3 binary intervals to the right of [1,2]. So the value returned will be -49.

32-bit Floating Point Numbers

Each 32-bit string represents the value

\begin{align*} (-1)^\sigma\left(1+\left(\frac{1}{2}\right)^{23}f\right)\cdot2^{e-127}, \end{align*}

where \sigma is the first bit of the string, e is the next 8 bits interpreted as a binary integer, and f is the remaining 23 bits are interpreted as a binary integer. In other words, there are equally spaced values represented between 1 and 2, and the same number of values in the interval [2,4), and so on, as well as [1/2,1), [1/4,1/2), and so on.

bitstring(Float32(-0.75)) returns 10111111010000000000000000000000

Find the positive difference between 1 and the first number greater than 1 which is representable as a Float32.

Solution. We can represent 1 by choosing f = 0 and e = 127. To represent the largest possible number less than 1, we let the 23-bit string representing f be 22 zeros followed by a 1. So the answer is 2^{-23}.

We can check this in Julia using nextfloat(Float32(1.0)) - 1, and indeed it returns a value which is equal to 2^{-23}: Float32(2)^(-23) == nextfloat(Float32(1.0)) - 1 returns true.

Arbitrary-Precision Numbers

Sometimes you might want store a number without being constrained to 64 bits, or even 128 or 256 bits. It is possible to define a type which uses an extensible number of bits (depending on the size of the integer or precision of the real number being stored). These types are called bignums, and in Julia they're called BigInt and BigFloat.

(i) Arbitrary-precision arithmetic is helpful for inspecting the behavior of lower precision formats. Find the exact value of the difference between 1.1 and the Float64 value nearest 1.1. (Hint: big(1.1) interprets 1.1 as a Float64 value—look at that value and mentally subtract 1.1).

(ii) Confirm that calculating the decimal representation of 2^{100,000} is no problem with big number arithmetic. Convert the resulting sequence of decimal digits to a string and find its length.

Solution. big(1.1) returns a number which is 8.8817841970012523233890533447265624 \times 10^{-17} more than 1.1.

length(string(big(2)^100000)) returns 30{,}103, so that's how many digits 2^{100,000} has.

To obtain numerical values of other types in Julia, use parse or big.


Float64 and Int64 operations are performed in hardware, meaning that they use instructions programmed directly into the computer's microprocessor. They are much faster and more memory efficient than arbitrary precision arithmetic, which has to be done in software.

Exercise Run the cell below to figure out about how many times slower BigInt addition is relative to Int64 addition.

a = big.(collect(1:100_000))
b = collect(1:100_000)
s = @elapsed(sum(a))
t = @elapsed(sum(b))

Solution. Outcomes can vary quite a bit, but you should find that the bignum operation typically takes 10-20 times as long.

General Comments

Choice of numerical representation depends on the application at hand, since each has its advantages and disadvantages. Int64arithmetic is actually modular arithmetic with a modulus of 2^{64}. This means that Int64 arithmetic is exact unless our calculation takes us outside the window of representable values. Basic Float64operations return the same number as computing the mathematical result of the operation and rounding to the nearest Float64-representable value (or one of the two nearest ones if it's halfway between).

Without using a computer, perform the operations in the expression (1.0 + 0.4/2^{52}) + 0.4/2^{52} using Float64arithmetic. Repeat with the expression 1.0 + (0.4/2^{52} + 0.4/2^{52}). Then check your findings by evaluating these expressions in Julia.

Solution. The first expression evaluates to 1.0, since adding 0.4/2^{52} only gets you 40% of the way to the next representable value. The second expression evaluates to 1 + 2^{-52}. And indeed, (1.0 + 0.4/2^52) + 0.4/2^52 == 1.0 + (0.4/2^52 + 0.4/2^52) returns false.

A numeric literal is a sequence of characters to be parsed and interpreted as a numerical value.

  • Numeric literals with a decimal point are real literals.
  • Numeric literals without a decimal point are integer literals.

In the expression 2.718^50+1, 2.718 is a real literal, and 50 and 1 are both integer literals.

Integer literals are interpreted as values, and real literals are interpreted as values.

2^100 returns 0, since 2 and 100 are interpreted as Int64values, and 2^{100} is equivalent to 0 modulo 2^{64}.

Explain why it is never necessary to use a BigInt for a loop counter (that is, a variable which starts at 0 or 1 and is incremented by 1 each time the body of the loop runs).

Solution. An Int64 can store values as large as 2^{63}-1. Assuming optimistically that the body of the loop can execute in a nanosecond, the loop could run for nearly 300 years before exhausting the positive Int64 values. Therefore, we do not need BigInt values for a loop counter.

Bruno Bruno