[Tutor] Reading binary files #2

Mon Feb 9 20:40:43 CET 2009

Hi Bob

some replies below.  One thing I noticed with the "full" file was 
that I ran into problems when the number of records was 10500, and 
the file read got misaligned.  Presumably 10500 is still within the 
range of int?

Best regards

Alun

At 17:49 09/02/2009, bob gailer wrote:
>etrade.griffiths at dsl.pipex.com wrote:
>>Hi
>>
>>following last week's discussion with Bob Gailer about reading 
>>unformatted FORTRAN files, I have attached an example of the file 
>>in ASCII format and the equivalent unformatted version.
>
>Thank you. It is good to have real data to work with.
>
>>Below is some code that works OK until it gets to a data item that 
>>has no additional associated data, then seems to have got 4 bytes 
>>ahead of itself.
>
>Thank you. It is good to have real code to work with.
>
>>I though I had trapped this but it appears not.  I think the issue 
>>is asociated with "newline" characters or the unformatted equivalent.
>>
>
>I think not, But we will see.
>
>I fail to see where the problem is. The data printed below seems to 
>agree with the files you sent. What am I missing?

When I run the program it exits in the middle but should run through 
to the end.  The output to the console was

236 ('\x00\x00\x00\x10', 'DATABEGI', 0, 'MESS', 
'\x00\x00\x00\x10\x00\x00\x00\x10')
264 ('TIME', '    \x00\x00\x00\x01', 1380270412, '\x00\x00\x00\x10', 
'\x00\x00\x00\x04\x00\x00\x00\x00')

Here "TIME" is in vals[0] when it should be in vals[1] and so on.  I 
found the problem earlier today and I re-wrote the main loop as 
follows (before I saw your helpful coding style comments):

while stop < nrec:

     # extract data structure

     start, stop = stop, stop + struct.calcsize('4s8si4s4s')
     vals = struct.unpack('>4s8si4s4s', data[start:stop])
     items.extend(vals[1:4])
     print stop, vals

     # define format of subsequent data

     nval = int(vals[2])

     if vals[3] == 'INTE':
         fmt_string = '>i'
     elif vals[3] == 'CHAR':
         fmt_string = '>8s'
     elif vals[3] == 'LOGI':
         fmt_string = '>i'
     elif vals[3] == 'REAL':
         fmt_string = '>f'
     elif vals[3] == 'DOUB':
         fmt_string = '>d'
     elif vals[3] == 'MESS':
         fmt_string = '>%ds' % nval
     else:
         print "Unknown data type ... exiting"
         print items[-40:]
         sys.exit(0)

     # leading spaces

     if nval > 0:
         start, stop = stop, stop + struct.calcsize('4s')
         vals = struct.unpack('4s', data[start:stop])

     # extract data

     for i in range(0,nval):
         start, stop = stop, stop + struct.calcsize(fmt_string)
         vals = struct.unpack(fmt_string, data[start:stop])
         items.extend(vals)

     # trailing spaces

     if nval > 0:
         start, stop = stop, stop + struct.calcsize('4s')
         vals = struct.unpack('4s', data[start:stop])

Now I get this output

232 ('\x00\x00\x00\x10', 'DATABEGI', 0, 'MESS', '\x00\x00\x00\x10')
256 ('\x00\x00\x00\x10', 'TIME    ', 1, 'REAL', '\x00\x00\x00\x10')

and the script runs to the end

>FWIW a few observations re coding style and techniques.
>
>1) put the formats in a dictionary before the while loop:
>formats = {'INTE': '>i', 'CHAR': '>8s', 'LOGI': '>i', 'REAL': '>f', 
>'DOUB': '>d', 'MESS': ''>d,}
>
>2) retrieve the format in the while loop from the dictionary:
>format = formats[vals[3]]

Neat!!

>3) condense the 3 infile lines:
>data = open("test.bin","rb").read()

I still don't quite trust myself to "chain" functions together, but I 
guess that's lack of practice

>4) nrec is a misleading name (to me it means # of records), nbytes 
>would be better.

Agreed

>5) Be consistent with the format between calcsize and unpack:
>struct.calcsize('>4s8si4s8s')
>
>6) Use meaningful variable names instead of val for the unpacked data:
>blank, name, length, typ = struct.unpack ... etc

Will do

>7) The format for MESS should be '>d' rather than '>%dd' % nval. 
>When nval is 0 the for loop will make 0 cycles.

Wasn't sure about that one.  "MESS" implies string but I wasn't sure 
what to do about a zero-length string

>8) You don't have a format for DATA (BEGI); therefore the prior 
>format (for CHAR) is being applied. The formats are the same so it 
>does not matter but could be confusing later.

DATABEGI should be a keyword to indicate the start of the "proper" 
data which has format MESS (ie string).  You did make me look again 
at the MESS format and it should be '>%ds' % nval and not '>%dd' % nval