CSCI 2227, Introduction to Scientific Computation
Prof. Alvarez
IEEE 754 Floating Point Standard
The IEEE 754 standard describes a set of rules for representing
floating point numbers such as -25.77 and 2.5*10-29
in terms of strings of binary digits. This provides a finite precision
model of real numbers for use in scientific computation.
Binary scientific notation
IEEE 754 is based on "normalized scientific notation" in base 2.
The number to be represented is first converted to the form
mantissa * 2exponent
where the mantissa has a value between 1 (inclusive) and 2 (exclusive).
The mantissa and exponent are then expressed in binary positional notation
(refer to the discussion in the first lecture).
This yields the desired normalized scientific notation for the number.
Example
In normalized base 2 scientific notation, the value 10+(1/8) is represented
as
1.010001 * 2011
The actual bit strings could change depending on the number of bits
allocated for the mantissa and exponent. For instance, using 8 bits
for each of these, we would have the following representation:
1.0100010 * 200000011
Assembling the IEEE 754 representation
IEEE 754 comes in several different levels of precision: single (32 bits),
double (64 bits), extended (usually 80 bits), and quadruple (128 bits).
I will only discuss the double precision version here, as it is very
widely used, and it is the version to which MATLAB defaults.
The double precision IEEE 754 representation of a number is broken down
as follows:
sign (1 bit) exponent (11 bits) mantissa (52 bits)
Notes
- The sign bit is 0 if the target number is non-negative
and 1 if the target number is negative.
- The exponent is encoded in "excess-1023" notation,
which just means that what is stored is actually the binary
representation of (exponent+1023).
- The mantissa is stored without the leading 1.
Example
Following the above steps, we find the IEEE 754 representation
of the value 10 + (1/8):
- The sign bit is 0 since 10 + (1/8) is non-negative.
- In "excess-1023" notation, the exponent 3 is the
standard binary representation of 1026 using 11 bits, which is 10000000010.
- Without the leading 1, the mantissa 1.0100010
is 010001 plus an additional 46 zeros to fill the
required 52 bits.
We conclude that the double-precision IEEE 754 representation of 10 + (1/8) is:
0 10000000010 01000100 + 44 more zeros
This would normally be partitioned 4 bits at a time and expressed
in hexadecimal (base 16) notation, as follows:
0100 0000 0010 0100 0100 0000 0000 0000 0000
=
4 0 2 4 4 0 0 0 0 0 0 0 0 0 0 0
You can check the result in MATLAB by formatting the value 10 + 1/8 in hex.