[Numpy-discussion] Fastest way to parsing a specific binay file

Robert Kern robert.kern at gmail.com
Wed Sep 2 11:11:35 EDT 2009


On Wed, Sep 2, 2009 at 09:38, Gökhan Sever<gokhansever at gmail.com> wrote:
> Hello,
>
> I want to be able to parse a binary file which hold information regarding to
> experiment configuration and data obviously. Both configuration and data
> sections are variable-length. A chuck this data is shown as below (after a
> binary read operation)
>
> '\x00\x00@\x00$\x00\x02\x00\x12\x00\xff\x00\x00\x00U\xaa\xfa\xffd\x00\x08\x00\x01\x00\x08\x00\xff\x00\x00\x00U\xaa\xfb\xffl\x00\xab\x00\x01\x00\xab\x00\xff\x00\x00\x00U\xaa\xe7\x03\x17\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00U\xaa\xd9\x07\x04\x00\x02\x00\r\x00\x06\x00\x03\x00\x00\x00\x01\x00\x00\x00\xd9\x07\x04\x00\x02\x00\r\x00\x06\x00\x03\x00\x00\x00\x01\x00\x00\x00prj.300\x00;
> Version = 1\n', 'ProjectName = PME1 2009 King Air N825ST\n', 'FlightId =
> \n', 'AircraftType = WMI King Air 200\n', 'AircraftId = N825ST\n',
> 'OperatorName = Weather Modification Inc.\n', 'Comments = \n', '\x00\x00@
>
> In binary form the file is 1.3MB, and when written to a txt file it expands
> to 3.7MB totalling approximately 4 million characters. When fully processed
> (with an IDL code) it produces 86 seperate configuration files, and 46 ascii
> files for data, about 10-15 different instruments and in various
> combinations plus sampling rates.
>
> I attemted to use RE module, however the time it takes parse the file is
> really longer than I expected. What would be wisest and fastest way to
> tackle this issue? Upon successful re-construction of the data and metadata,
> I am planning to use a much modular structure like HDF5 or netCDF4 for an
> easy data storage and analyses.

Are there fixed delimiters? Like '\x00\x00@\x00' perhaps? It might be
faster to search for those using .find() instead of regexes.

Without more information about how the file format gets split up, I'm
not sure we can make good suggestions.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco



More information about the NumPy-Discussion mailing list