Subsection 6.2.2 Error in storing a real number as a floating point number
ΒΆRemark 6.2.2.1.
We consider the case where a real number is truncated to become the stored floating point number. This makes the discussion a bit simpler.
Let positive \(\chi \) be represented by
where \(\delta_i \) are binary digits and \(\delta_0 = 1 \) (the mantissa is normalized). If \(t \) binary digits are stored by our floating point system, then
is stored (if truncation is employed). If we let \(\delta\!\chi = \chi - \check \chi \text{.}\) Then
Since \(\chi \) is positive and \(\delta_0 = 1 \text{,}\)
Thus,
which can also be written as
A careful analysis of what happens when \(\chi \) equals zero or is negative yields
Example 6.2.2.2.
The number \(4/3 = 1.3333\cdots \) can be written as
Now, if \(t = 4 \) then this would be truncated to
which equals the number
The relative error equals
If \(\check \chi \) is computed by rounding instead of truncating, then
We can abstract away from the details of the base that is chosen and whether rounding or truncation is used by stating that storing \(\chi \) as the floating point number \(\check \chi \) obeys
where \(\meps \) is known as the machine epsilon or unit roundoff. When single precision floating point numbers are used \(\meps \approx 10^{-8} \text{,}\) yielding roughly eight decimal digits of accuracy in the stored value. When double precision floating point numbers are used \(\meps \approx 10^{-16} \text{,}\) yielding roughly sixteen decimal digits of accuracy in the stored value.
Example 6.2.2.3.
The number \(4/3 = 1.3333\cdots \) can be written as
Now, if \(t = 4 \) then this would be rounded to
which is equals the number
The relative error equals
Definition 6.2.2.4. Machine epsilon (unit roundoff).
The machine epsilon (unit roundoff), \(\meps \text{,}\) is defined as the smallest positive floating point number \(\chi \) such that the floating point number that represents \(1 + \chi \) is greater than one.
Remark 6.2.2.5.
The quantity \(\meps\) is machine dependent. It is a function of the parameters characterizing how a specific architecture converts reals to floating point numbers.
Homework 6.2.2.1.
Assume a floating point number system with \(\beta = 2 \text{,}\) a mantissa with \(t \) digits, and truncation when storing.
Write the number \(1 \) as a floating point number in this system.
What is the \(\meps \) for this system?
-
Write the number \(1 \) as a floating point number.
Answer:
\begin{equation*} \begin{array}[t]{c} \underbrace{ . 1 0\cdots 0} \\ t \\ \mbox{ digits} \end{array} \times 2^1. \end{equation*} -
What is the \(\meps \) for this system?
Answer:
\begin{equation*} \begin{array}[t]{c} \underbrace{ \begin{array}[t]{c} \underbrace{ . 1 0\cdots 0} \\ t \mbox{ digits} \end{array} \times 2^1} \\ 1 \end{array} + \begin{array}[t]{c} \underbrace{ \begin{array}[t]{c} \underbrace{ .0 0 \cdots 1} \\ t \mbox{ digits} \end{array} \times 2^1} \\ 2^{-(t-1)} \end{array} = \begin{array}[t]{c} \underbrace{ \begin{array}[t]{c} \underbrace{ .1 0\cdots 1} \\ t \mbox{ digits} \end{array} \times 2^1} \\ \gt 1 \end{array} \end{equation*}and
\begin{equation*} \begin{array}[t]{c} \underbrace{ \begin{array}[t]{c} \underbrace{ . 1 0\cdots 0} \\ t \mbox{ digits} \end{array} \times 2^1} \\ 1 \end{array} + \begin{array}[t]{c} \underbrace{ \begin{array}[t]{c} \underbrace{ .0 0 \cdots 0} \\ t \mbox{ digits} \end{array} 1 1 \cdots \times 2^1} \\ \lt 2^{-(t-1)} \end{array} = \begin{array}[t]{c} \underbrace{ \begin{array}[t]{c} \underbrace{ .1 0\cdots 0} \\ t \mbox{ digits} \end{array} 1 1 \cdots \times 2^1} \\ \mbox{ truncates to } 1 \end{array} \end{equation*}Notice that
\begin{equation*} \begin{array}[t]{c} \underbrace{ .0 0 \cdots 1} \\ t \mbox{ digits} \end{array} \times 2^1 \end{equation*}can be represented as
\begin{equation*} \begin{array}[t]{c} \underbrace{ .1 0 \cdots 0} \\ t \mbox{ digits} \end{array} \times 2^{-(t-2)} \end{equation*}and
\begin{equation*} \begin{array}[t]{c} \underbrace{ .0 0 \cdots 0} \\ t \mbox{ digits} \end{array} 1 1 \cdots \times 2^1 \end{equation*}as
\begin{equation*} \begin{array}[t]{c} \underbrace{ .1 1 \cdots 1} \\ t \mbox{ digits} \end{array} \times 2^{-(t-1)} \end{equation*}Hence \(\meps = 2^{-(t-1)} \text{.}\)