Inconsistency producing constant for float "infinity"

Sat Aug 12 18:27:46 EDT 2006

[Tim Peters]
>    ...
>> It has a much better chance of working from .pyc in Python 2.5.
>> Michael Hudson put considerable effort into figuring out whether the
>> platform uses a recognizable IEEE double storage format, and, if so,
>> marshal and pickle take different paths that preserve infinities,
>> NaNs, and signed zeroes.

[Alex Martelli]
> Isn't marshal constrained to work across platforms (for a given Python
> release), and pickle also constrainted to work across releases (for a
> given protocol)?

Yes to both.

> I'm curious about how this still allows them to "take different
> paths" (yeah, I _could_ study the sources, but I'm lazy:-)...

Good questions.  Pickle first:  pickle, with protocol >= 1, has always
had a binary format for Python floats, which is identical to the
big-endian IEEE-754 double-precision storage format.  This is
independent of the native C double representation:  even on a non-IEEE
box (e.g, VAX or Cray), protocol >= 1 pickle does the best it can to
/encode/ native doubles in the big-endian 754 double storage /format/.
 This is explained in the docs for the "float8" opcode in
pickletools.py:

             The format is unique to Python, and shared with the struct
             module (format string '>d') "in theory" (the struct and cPickle
             implementations don't share the code -- they should).  It's
             strongly related to the IEEE-754 double format, and, in normal
             cases, is in fact identical to the big-endian 754 double format.
             On other boxes the dynamic range is limited to that of a 754
             double, and "add a half and chop" rounding is used to reduce
             the precision to 53 bits.  However, even on a 754 box,
             infinities, NaNs, and minus zero may not be handled correctly
             (may not survive roundtrip pickling intact).

The problem has been that C89 defines nothing about signed zeroes,
infinities, or NaNs, so even on a 754 box there was no consistency
across platforms in what C library routines like frexp() returned when
fed one of those things.  As a result, what Python's "best non-heroic
effort" code for constructing a 754 big-endian representation actually
did was a platform-dependent accident when fed a 754 special-case
value.  Likewise for trying to construct a native C double from a 754
representation of a 754 special-case value -- again, there was no
guessing what C library routines like ldexp() would return in those
cases.

Part of what Michael Hudson did for 2.5 is add code to guess whether
the native C double format /is/ the big-endian or little-endian 754
double-precision format.  If so, protocol >= 1 pickle in 2.5 uses much
simpler code to pack and unpack Python floats, simply copying from/to
native bytes verbatim (possibly reversing the byte order, depending on
platform endianness).  Python doesn't even try to guess whether a C
double is "normal", or an inf, NaN, or signed zero then, so can't
screw that up -- it just copies the bits blindly.

That's much better on IEEE-754 boxes, although I bet it still has
subtle problems.  For example, IIRC, 754 doesn't wholly define the
difference in storage formats for signaling NaNs versus quiet NaNs, so
I bet it's still theoretically possible to pickle a signaling NaN on
one 754 box and get back a quiet NaN (or vice versa) when unpickled on
a different 754 box.

Protocol 0 (formerly known as "text mode") pickles are still a crap
shoot for 754 special values, since there's still no consistency
across platforms in what the C string<->double routines produce or
accept for special values.

Now on to marshal.  Before 2.5, marshal only had a "text mode" storage
format for Python floats, much like protocol=0 pickle.  So, as for
pickle protocol 0, what marshal produced or reconstructed for a 754
special value was a platform-dependent accident.

Michael added a binary marshal format for Python floats in 2.5, which
uses the same code protocol >= 1 pickle uses for serializing and
unserializing Python floats (except that the marshal format is
little-endian instead of big-endian).  These all go thru
floatobject.c's _PyFloat_Pack8 and _PyFloat_Unpack8 now, and a quick
glance at those will show that they take different paths according to
whether Michael's native-format-guessing code decided that the native
format was ieee_big_endian_format, ieee_little_endian_format, or
unknown_format.  The long-winded pre-2.5 pack/unpack code is only used
in the unknown_format case now.