Newbie question: how to process a binary file

Fri May 14 11:40:24 EDT 1999

Paul Stillwell <paul at polariscomm.com> wrote:
: Hi,

: I am new to Python so I'm sorry if this question has been asked and
: answered.  I have looked all over and can't find the info I need.  Here is
: the problem:

: I am writing an application that FTPs to a server and gets a binary file.
: The file has a known format (to me) and I need to parse the file into bytes
: (8 bits), longs (32 bits) and long longs (64 bits) based on the format of
: the file.  So, there could be 4 bytes followed by a long long, folowed by
: another 4 bytes, etc.  To further complicate the problem, some of the longs
: will have to be byte swapped.  The reading of the bytes seems fairly
: straight forward (using file.read(1)), but getting the data into the correct
: type seems to be the trick.  If I were doing this in C I could just cast the
: data, but I'm not sure how to achieve the same effect in Python.

: So, any suggestions?  I am wide open to any ideas anyone may have.  I am
: concerned about the time it will take to process the file (I actually have 3
: different files to download and parse.  Each file having a different
: format), so I don't want to apply my C mindset to this problem because I am
: sure that I will end up with slower code and a much less elegant solution.

: Thanks for any help you can provide!

You want to be using the struct module.  Unless the files are very large,
I would suggest downloading them before processing them.

But either way, the longest wait will be to read the data (from disk or
the net), so attempt to read as much as you can into a buffer while
processing for efficiency.

Long longs are not supported by the struct module, but they are pretty
easy to interpret.

Here is an example:

  import struct, time

  def unpackstr(s, i=0):
    for c in s:
      if c == '\000': break
      i = i + 1
    return s[:i]

  # the C struct declaration  (on AIX 4.2.1)
  #   char ut_user[8];
  #   char ut_id[14];
  #   char ut_line[12];
  #   short ut_type;
  #   pid_t ut_pid; (int)
  #   short ut_exit.e_termination;
  #   short ut_exit.e_exit;
  #   time_t ut_time; (long)
  #   char ut_host[16];
  # the format string to represent that structure
  fmt = '8s14s12shi2hl16s'
  # constants (for ut_type):
  UT_EMPTY =           0
  UT_RUN_LVL =         1
  UT_BOOT_TIME =       2
  UT_OLD_TIME =        3
  UT_NEW_TIME =        4
  UT_INIT_PROCESS =    5
  UT_LOGIN_PROCESS =   6
  UT_USER_PROCESS =    7
  UT_DEAD_PROCESS =    8
  UT_ACCOUNTING =      9
  record_size = struct.calcsize(fmt)

  # read who is logged on to a UNIX system
  file = open('/etc/utmp', 'rb')
  block = file.read(record_size)
  while block:
    (user, id, line, type, pid, term, exit, ut_time, host) = \
      struct.unpack(fmt, block)
    # only print the logins registered (not other info)
    if type == UT_LOGIN_PROCESS:
      user = unpackstr(user) # remove the null characters
      line = unpackstr(line)
      host = unpackstr(host)
      if host:
        host = '(%s)' % host
      print '%-8s %-12s %-24s %s' % (user, line, time.ctime(ut_time), host)
    block = file.read(record_size)
  file.close()

As you can see there is a bit of initial planning to be done, but it
is pretty easy to use once you get the hang of it. :)

For long longs, I would suggest something like:
  fmt = 'lL'  # signed long, unsigned long
  (l1, l2) = struct.unpack(fmt, buffer)
  l_value = (long(l1) << 32) | long(l2)

People will probably find better (more correct?) ways for this, this is
just a quick&dirty example of how you might unpack them.  The problem
would be big endian versus little endian.

Byte swapping can be done in a similar way:
  short = 34212
  packed = struct.pack('h', short)
  swapped = struct.unpack('h', packed[1]+packed[0])
or:
  b1, b2 = struct.unpack('2c', struct.pack('h', short))
  swapped = struct.unpack('h', struct.pack('2c', b1, b2))

  -Arcege