Floating point problem

Mon Apr 20 16:40:10 EDT 2020

On 4/20/20 4:19 PM, Chris Angelico wrote:
> On Tue, Apr 21, 2020 at 6:07 AM Schachner, Joseph
> <Joseph.Schachner at teledyne.com> wrote:
>> 16 base 10 digits / log base10( 2) = 53.1508495182 bits.   Obviously, fractional bits don't exist, so 53 bits. If you note that the first non-zero digit as 4, and the first digit after the 15 zeroes was 2, then you got an extra bit. 54 bits.  Where did the extra bit come from?  It came from the IEEE format's assumption that the top bit of the mantissa of a normalized floating point value must be 1.  Since we know what it must be, there is no reason to use an actual bit for it.  The 53 bits in the mantissa do not include the assumed top bit.
>>
>> Isn't floating point fun?
>>
> IEEE 64-bit packed floating point has 53 bits of mantissa, 11 scale
> bits (you've heard of "scale birds" in art? well, these are "scale
> bits"), and 1 sign bit.
>
> 53 + 11 + 1 == 64.
>
> Yep, floating point is fun.
>
> That assumed top 1 bit is always there, except when it isn't. Because
> denormal numbers are a thing. They don't have that implied 1 bit.
>
> Yep, floating point is fun.
>
> ChrisA

Well, the assumed 1 isn't there unless the exponent is all zeros, in
which case you have a denormal or zero value, which have all their bits
explicitly so we can get 53 more powers of two of range with reducing
precision. With Binary floating point, you only have denormals near
underflow.

Now Decimal Floating point doesn't have this implied leading 1, but can
have denormals at almost all of the ranges.

-- 
Richard Damon