Dealing with binary data...

Sat Mar 4 14:27:35 EST 2000

[posted & mailed]

[Thomas A. Bryan]
> I'm trying to work with a data file format defined by Fortran programmers.
> I'd like to write some Python to read and write the data.  I like
> Python's struct because I can simply specify '<' at the beginning of
> the format string to guarantee a platform independent reader/writer
> for this format.
>
> I'm hitting one problem.  The format contains a fixed-size series of
> 4-byte (little-endian) floats.  When there isn't enough data to fill
> up the file, each float is padded with the bit pattern of
> ff7f ff7f
> The file format definition explains it as one (little-endian) integer
> 32767 in each of the float's two bytes.

Unfortunately, that bit pattern doesn't correspond to a finite IEEE-754
float.  Python *is* blowing this, but not where you think <wink>.

> Here's the problem:
>
> Python 1.5.2 (#1, Apr 18 1999, 16:03:16)  [GCC pgcc-2.91.60 19981201
> (egcs-1.1.1  on linux2
> Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
> >>> import struct
> >>> struct.pack('<hh',32767,32767)
> '\377\177\377\177'
> >>> struct.unpack('<f','\377\177\377\177')
> (6.79235465281e+38,)

That's where it's blowing it.  This should not yield a normal floating-point
value.  Look for the comment

	/* XXX This sadly ignores Inf/NaN issues */

in structmodule.c's unpack_float() function.  Unclear what it should do,
though, as the Python language itself ignores the possibility of infs and
NaNs (that's all a platform-dependent crap shoot).

> >>> floatNum = struct.unpack('<f','\377\177\377\177')[0]
> >>> struct.pack('<f',floatNum)
> Traceback (innermost last):
>   File "<stdin>", line 1, in ?
> OverflowError: float too large to pack with f format

This is legit.  The largest finite IEEE-754 float is about 3.4e+38, and
Python is getting that part right:

>>> struct.pack('<f', 3.4e38)  # ~= largest finite float
'\236\311\177\177'
>>> struct.pack('<f', 3.41e38) # a little bigger than the largest
Traceback (innermost last):
  File "<pyshell#22>", line 1, in ?
    struct.pack('<f', 3.41e38)
OverflowError: float too large to pack with f format
>>>

> ...
> If this behavior is expected, then I suppose that I'll have to
> unpack each float twice...once into a pair of shorts (to check
> for the "no data" values) and then into a float (if the data is
> present).  Then, when I output the data, I'll have to check
> each float.  If it's None, pack the two ints.  If it's not None,
> pack the float.

You'll have to do *something* to distinguish real floats from padding, but
that's up to you.

Another way to detect the dummy values is this:

    dummy = math.frexp(the_unpacked_float)[1] > 128

because, e.g.,

>>> math.frexp(3.4e38)
(0.999170198199, 128)
>>> math.frexp(3.41e38)
(0.501054467038, 129)
>>>

That's cheap, and will catch other cases where the input data is insane too.

good-to-the-last-bit-ly y'rs  - tim