Floating point problem

Tue Apr 21 09:29:27 EDT 2020

Chris Angelico <rosuav at gmail.com> writes:

> On Tue, Apr 21, 2020 at 6:07 AM Schachner, Joseph
> <Joseph.Schachner at teledyne.com> wrote:
>>
>> 16 base 10 digits / log base10( 2) = 53.1508495182 bits. Obviously,
>> fractional bits don't exist, so 53 bits. If you note that the first
>> non-zero digit as 4, and the first digit after the 15 zeroes was 2,
>> then you got an extra bit. 54 bits. Where did the extra bit come from?
>> It came from the IEEE format's assumption that the top bit of the
>> mantissa of a normalized floating point value must be 1. Since we know
>> what it must be, there is no reason to use an actual bit for it. The
>> 53 bits in the mantissa do not include the assumed top bit.
>>
>> Isn't floating point fun?
>>
>
> IEEE 64-bit packed floating point has 53 bits of mantissa, 11 scale
> bits (you've heard of "scale birds" in art? well, these are "scale
> bits"), and 1 sign bit.
>
> 53 + 11 + 1 == 64.

I don't know if this was meant to be sarcasm, but in my universe 53 + 11 + 1 == 65.
But the floating point standard uses 53 bits of precision where only 52 bits are stored.
52 + 11 + 1 == 64.

> Yep, floating point is fun.
>
> That assumed top 1 bit is always there, except when it isn't. Because
> denormal numbers are a thing. They don't have that implied 1 bit.

Yes, for subnormal numbers the implicit bit *is* stored. They are characterized by the biased exponent being 0. So these have only 52  bits of precision in the mantissa.

> Yep, floating point is fun.
>
> ChrisA

-- 
Pieter van Oostrum
www: http://pieter.vanoostrum.org/
PGP key: [8DAE142BE17999C4]