Floating point problem

Mon Apr 20 16:19:31 EDT 2020

On Tue, Apr 21, 2020 at 6:07 AM Schachner, Joseph
<Joseph.Schachner at teledyne.com> wrote:
>
> 16 base 10 digits / log base10( 2) = 53.1508495182 bits.   Obviously, fractional bits don't exist, so 53 bits. If you note that the first non-zero digit as 4, and the first digit after the 15 zeroes was 2, then you got an extra bit. 54 bits.  Where did the extra bit come from?  It came from the IEEE format's assumption that the top bit of the mantissa of a normalized floating point value must be 1.  Since we know what it must be, there is no reason to use an actual bit for it.  The 53 bits in the mantissa do not include the assumed top bit.
>
> Isn't floating point fun?
>

IEEE 64-bit packed floating point has 53 bits of mantissa, 11 scale
bits (you've heard of "scale birds" in art? well, these are "scale
bits"), and 1 sign bit.

53 + 11 + 1 == 64.

Yep, floating point is fun.

That assumed top 1 bit is always there, except when it isn't. Because
denormal numbers are a thing. They don't have that implied 1 bit.

Yep, floating point is fun.

ChrisA