CSCI 2227, Introduction to Scientific Computation
Prof. Alvarez
IEEE 754 Floating Point Standard

The IEEE 754 standard describes a set of rules for representing floating point numbers such as -25.77 and 2.5*10^-29 in terms of strings of binary digits. This provides a finite precision model of real numbers for use in scientific computation.

Binary scientific notation

IEEE 754 is based on "normalized scientific notation" in base 2. The number to be represented is first converted to the form

mantissa * 2^exponent

where the mantissa has a value between 1 (inclusive) and 2 (exclusive). The mantissa and exponent are then expressed in binary positional notation (refer to the discussion in the first lecture). This yields the desired normalized scientific notation for the number.

Example

In normalized base 2 scientific notation, the value 10+(1/8) is represented as

1.010001 * 2⁰¹¹

The actual bit strings could change depending on the number of bits allocated for the mantissa and exponent. For instance, using 8 bits for each of these, we would have the following representation:

1.0100010 * 2^00000011

Assembling the IEEE 754 representation

IEEE 754 comes in several different levels of precision: single (32 bits), double (64 bits), extended (usually 80 bits), and quadruple (128 bits). I will only discuss the double precision version here, as it is very widely used, and it is the version to which MATLAB defaults.

The double precision IEEE 754 representation of a number is broken down as follows:

sign (1 bit) exponent (11 bits) mantissa (52 bits)

Notes

The sign bit is 0 if the target number is non-negative and 1 if the target number is negative.
The exponent is encoded in "excess-1023" notation, which just means that what is stored is actually the binary representation of (exponent+1023).
The mantissa is stored without the leading 1.

Example

Following the above steps, we find the IEEE 754 representation of the value 10 + (1/8):

The sign bit is 0 since 10 + (1/8) is non-negative.
In "excess-1023" notation, the exponent 3 is the standard binary representation of 1026 using 11 bits, which is 10000000010.
Without the leading 1, the mantissa 1.0100010 is 010001 plus an additional 46 zeros to fill the required 52 bits.

We conclude that the double-precision IEEE 754 representation of 10 + (1/8) is:

0 10000000010 01000100 + 44 more zeros

This would normally be partitioned 4 bits at a time and expressed in hexadecimal (base 16) notation, as follows:

0100 0000 0010 0100 0100 0000 0000 0000 0000 = 4 0 2 4 4 0 0 0 0 0 0 0 0 0 0 0

You can check the result in MATLAB by formatting the value 10 + 1/8 in hex.

CSCI 2227, Introduction to Scientific Computation Prof. Alvarez IEEE 754 Floating Point Standard

Binary scientific notation

Example

Assembling the IEEE 754 representation

Notes

Example

CSCI 2227, Introduction to Scientific Computation
Prof. Alvarez
IEEE 754 Floating Point Standard