Floating Point Review

Chuck Garcia

Floating Point

FP - Intro:

  1. Given a value fractional value: d_md_{m-1}...d_0 . d_0...d_n
  2. We would like to be able to represent it in binary
  3. Floating-Point representation is one way in which we can achieve this

FP - Format:

  1. Floating point numbers allow us to represent decimal values of the form: V = (-1)^s \times M \times 10^e
    • V : value
    • M : Fractional (significand) part of value (e.g 1.123)
    • s : Sign
    • e : Exponent part

FP - Bit Layout:

s e⁰ f⁰
  1. Single Precision:
    • 32 bit width:
      • float in C
  2. Double Precision:
    • 64 bit width
      • double in C

Preliminary Remark:

  1. Floating point is a representation that facilitates the encoding of fractional decimal values that are in scientific notation form.
    • e.g 23.12 \times 10^1
  2. All decimal values that are encoded, are encoded such that only 1 digit
    left of the decimal point.

FP - Encoding Cases:

  1. The arrangement of bits in a bit representation, effect the way we treat floating point:
    • 3 different Cases:
      1. Normalized:
        • If the exponent field is not all zeros and not all ones, then the floating point is said to be normalized.
      2. Denormalized:
        • If the exponent field is all zeros, then the floating point value is said to be denormal.
      3. Special:
        • If the exponent field is all 1’s, then the floating point value is a special value (e.g infinity, or NaN).

1. Normalized:

2. Denormalized:

Multiplication

Basics (Assuming A and B are normalized values):

Example:

Truncation in Multiplication:

  1. Adjustment: Reduce the 2p-bit product to ‘p’ bits, deciding which bits to keep.
  2. Rounding: Apply a rounding strategy (e.g., round to nearest, toward zero) to minimize truncation error.
  3. Overflow and Underflow: Check and handle cases where the product is too large (overflow) or too small (underflow).
  4. Normalization: Adjust the significand and exponent so the significand falls within the expected range.
  5. Sign Bit: Determine the sign of the result using the XOR of the operands’ sign bits.