Editing Floating-point arithmetic (section)

=== Internal representation ===
Floating-point numbers are typically packed into a computer datum as the sign bit, the exponent field, and a field for the significand, from left to right. For the [[IEEE 754]] binary formats (basic and extended) that have extant hardware implementations, they are apportioned as follows:

{| class="wikitable" style="text-align:right; border:0"
|-
!rowspan="2" |Format
!colspan="4" |Bits for the encoding<!-- Since this is about the encoding, it should be clear that the number given for the significand below excludes the implicit bit, when this is used. -->
| rowspan="8" style="background:white; border:0"|
!rowspan="2" |Exponent<br>bias
!rowspan="2" |Bits<br>precision
!rowspan="2" |Number of<br>decimal digits
|-
!Sign
!Exponent
!Significand
!Total
|-
|[[Half-precision floating-point format|Half]] (binary16)
|1
|5
|10
|16
|15
|11
|~3.3
|-
|[[Single-precision floating-point format|Single]] (binary32)
|1
|8
|23
|32
|127
|24
|~7.2
|-
|[[Double-precision floating-point format|Double]] (binary64)
|1
|11
|52
|64
|1023
|53
|~15.9
|-
|[[Extended precision#x86 extended-precision format|x86 extended]]
|1
|15
|64
|80
|16383
|64
|~19.2
|-
|[[Quadruple-precision floating-point format|Quadruple]] (binary128)
|1
|15
|112
|128
|16383
|113
|~34.0
|-
|[[Octuple-precision floating-point format|Octuple]] (binary256)
|1
|19
|236
|256
|262143
|237
|~71.3
|}

While the exponent can be positive or negative, in binary formats it is stored as an unsigned number that has a fixed "bias" added to it. Values of all 0s in this field are reserved for the zeros and [[subnormal number]]s; values of all 1s are reserved for the infinities and NaNs. The exponent range for normal numbers is [−126, 127] for single precision, [−1022, 1023] for double, or [−16382, 16383] for quad. Normal numbers exclude subnormal values, zeros, infinities, and NaNs.

In the IEEE binary interchange formats the leading bit of a normalized significand is not actually stored in the computer datum, since it is always 1. It is called the "hidden" or "implicit" bit. Because of this, the single-precision format actually has a significand with 24 bits of precision, the double-precision format has 53, quad has 113, and octuple has 237.

For example, it was shown above that π, rounded to 24 bits of precision, has:
* sign = 0 ; ''e'' = 1 ; ''s'' = 110010010000111111011011 (including the hidden bit)
The sum of the exponent bias (127) and the exponent (1) is 128, so this is represented in the single-precision format as
* 0 10000000 10010010000111111011011 (excluding the hidden bit) = 40490FDB<ref name="IEEE-754_Analysis"/> as a [[hexadecimal]] number.

An example of a layout for [[Single-precision floating-point format|32-bit floating point]] is
[[File:Float example.svg|none]]
and the [[Double-precision floating-point format|64-bit ("double")]] layout is similar.