[Numpy-discussion] Fastest way to parsing a specific binay file

Gökhan Sever gokhansever at gmail.com
Fri Sep 4 17:07:23 EDT 2009


On Thu, Sep 3, 2009 at 12:22 AM, Robert Kern <robert.kern at gmail.com> wrote:

> On Wed, Sep 2, 2009 at 23:59, Gökhan Sever<gokhansever at gmail.com> wrote:
>
> > Robert,
> >
> > You must have thrown a couple RTFM's while replying my emails :)
>
> Not really. There's no manual for this. Greg Wilson's _Data Crunching_
> may be a good general introduction to how to think about these
> problems.
>
> http://www.pragprog.com/titles/gwd/data-crunching
>
> > I usually
> > take trial-error approaches initially, and don't give up unless I hit a
> > hurdle so fast, which in this case resulted with the unsuccessful regex
> > approach. However from the good point I have learnt the basics of regular
> > expressions and realized how powerful could they be during a text parsing
> > task.
> >
> > Enough prattle, below is what I am working on:
> >
> > So far I was successfully able to extract the file names and the data
> > associated with those names (with the exception of multiple buffer per
> file
> > cases).
> >
> > However not reading time increments correctly, I should be seeing 1 sec
> > incremental time ticks from the time segment reading, but all it does is
> to
> > return the same first time information.
> >
> > Furthermore, I still couldn't figure out how to wrap the main looping
> suite
> > (range(500) is just a dummy number which will let me process whole binary
> > data) I don't know yet how to make the range input generic which will
> work
> > any size of similar binary file.
>
> while True:
>   ...
>
>   if no_more_data():
>       break
>
> > import numpy as np
> > import struct
> >
> > f = open('test.sea', 'rb')
> >
> > dt = np.dtype([('tagNumber', np.uint16), ('dataOffset', np.uint16),
> > ('numberBytes', np.uint16), ('samples', np.uint16), ('bytesPerSample',
> > np.uint16), ('type', np.uint8), ('param1', np.uint8), ('param2',
> > np.uint8), ('param3', np.uint8), ('address', np.uint16)])
> >
> >
> > start = 0
> > ct = 0
> >
> > for i in range(500):
> >
> >     header = np.fromstring(f.read(dt.itemsize), dt)[0]
> >
> >     if header['tagNumber'] == 65530:
> >         loc = f.tell()
> >         f.seek(start + header['dataOffset'])
> >         f.read(header['numberBytes'])
>
> Presumably you are doing something with this data, not just discarding it.
>
> >         f.seek(loc)
>
> This should be f.seek(loc, 0). f.seek(nbytes) is to seek forward from
> the current position by nbytes. The 0 tells it to start from the
> beginning.
>
> --
> Robert Kern
>
> "I have come to believe that the whole world is an enigma, a harmless
> enigma that is made terrible by our own mad attempt to interpret it as
> though it had an underlying truth."
>  -- Umberto Eco
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>


Thanks for the suggestions and sorry for the late replying. Was trying to
stay offline and read some dissertations.

Getting somewhere on the code. Fixed generic reading case with a try-except
on ValueError block, so far it works fine. As seen below, I was be able to
read the time and a specific data content,which in this case Cloud
Condensation Nuclei data recorded on the acquisition system. However, there
still exists a few weirdness behaviours. Take the following print-out for
example:

1344416
(18000, 84, 110, 1, 256, 37, 10, 0, 0, 61441)
1344468
H,13:09:51,0.59,0.00,28.43,32.65,36.26,26.60,29.54,38.12,27.98,45.01,453.77,426.25,76.25,0.14,9.35,3.34,0.00

It is supposed to print the correct data when I seek the cursor to
1344416+84 and read 110 characters more but when it doesn't work this way.
To make it correctly work I have to seek 52 lines instead of 84 (where I
compansate this by (f.tell() - dt.itemsize) which transports me where I
exactly want it to be.

####################################################################################################################################

The other point I discovered while studying the contents of this binary file
format and the existing IDL code is, there are many similar lines of code
occuring in scripts as well as code replicas. This is due to the fact that
the current postprocessing toolsuite has been configured to work on many
different field campaigns. That is to say, the existing code base could
process the binary data from a lab setup, and from a campaign in Mali or in
Saudi Arabia following its way by pre-defined processing scripts. Each
different campaign has its own specific configurations which kept in those
my very early mentioned config files ie. which instrurement was connected to
the system, at what port, type of communication (serial. analog), sampling
rate etc. Instead of taking this path, I can parse those config files and
make a very generic postprocessing script suite independent from the
campaign based on the fact that all of the config files will be placed in
the binary file.

####################################################################################################################################

I had to mention very early, my primary intention is to unify our ADPAA kit
using Python. https://www.ohloh.net/p/adpaa/analyses/latest as is shown from
ohloh's stats it approximately approaches to 170k lines of code. 6-8
different languages; IDL is being the master language. Currently all the
code has been written in procedural-linear fashion, majority of the tasks
are interconnected to eachother ie in order to create a higher lever
analyses file the lower stages in the processing hierarchy must be executed.
I don't know how much we gain to write these parts in object-oriented way,
but definetely will provide us a neater and less code repeated processing
software. I still quiet don't get the pros and cons of basing the design on
Traits. I might ask your opinions onto integration of Traits based design
into this type of project. As a very early estimate, I am thinking of
writing 5 to 10 times less code than what was written up to now. This said,
I have less than a year to finish my degree hear, and there are very many
field papers and books to read.

Good day.



#!/usr/bin/env python

import numpy as np
import struct

f = open('test.sea', 'rb')

dt = np.dtype([('tagNumber', np.uint16), ('dataOffset', np.uint16),
('numberBytes', np.uint16), ('samples', np.uint16), ('bytesPerSample',
np.uint16), ('type', np.uint8), ('param1', np.uint8), ('param2',
np.uint8), ('param3', np.uint8), ('address', np.uint16)])

start = 0

while True:
    try:
        header = np.fromstring(f.read(dt.itemsize), dt)[0]

        ### Read Time
        if header['tagNumber'] == 0 and header['address'] == 43605:
            start = f.tell()-dt.itemsize
            loc = f.tell()
            #start = f.tell()
            print f.tell()
            print header
            f.seek(start + header['dataOffset'])
            print f.tell()
            print struct.unpack('9H', f.read(18))
            print struct.unpack('9H', f.read(18))
            f.seek(loc, 0)

        # Read DMT-CCNC data
        if header['tagNumber'] == 18000 and header['type'] == 37:
            start = f.tell()-dt.itemsize*2
            loc = f.tell()
            print f.tell()
            print header
            f.seek(start + header['dataOffset'])
            print f.tell()
            print f.read(header['numberBytes'])
            print f.tell()
            f.seek(loc,0)

    except ValueError:
        break


-- 
Gökhan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20090904/a05d8161/attachment.html>


More information about the NumPy-Discussion mailing list