how to parse numeric data files

andrew cooke andrew at acooke.org
Tue Apr 29 14:11:02 EDT 2003


if you do parse with regexps, you may want to consider enfocing rules that
detect when files aren't formatted as you expect.  for example, specify
evey line completely (don't throw away the end if you don't expecet
anything else there) and specify what information should always be
present.  otherwise you can miss data (eg when a file format is "improved"
slightly during an upgrade to one of your testing machines) for a long
time.

personally, i'd try to go via a complete formal spec using a grammar to
avoid this kind of problem.  the example you gave should be pretty easy to
parse since every field is prefixed with a name (looks like LL(1), in
which case yapps should work (know nothing about this, it's just first hit
on googling for "python parser")
http://theory.stanford.edu/~amitp/Yapps/).

another disadvantage of regular expressions is that you need to add error
handling yourself.  a good parser generator should give you pretty useful
error handling from the start.  maybe this doesn't matter so much if this
is for internal use, though.

the big disadvantage of a formal grammar + parser is (of course) that as
soon as the language isn't directly specified by a grammar that the parser
generator can handle, knowing "that boring stuff about LALR" can be
critical.  the good news with LL(1) is that it is easy to understand (your
engineers will find it intuitive), the bad news is that LL(1) is pretty
weak, so it's more likely to fail.  however, a quick scan of yapps
suggests it has extensions that make it considerably more powerful
(actually, from 30 seconds first impression, i'm quite impressed - otoh it
may be quite slow).

if i were you i'd try yapps in my free time and see how it goes.  if it's
too nasty, try regexps and cross your fingers...

andrew

Peter Hansen said:

> george young wrote:
>>
>> We have several electronic testing machines(of various ages and
>> manufacturers) that spew out testing data files in various ascii
>> formats.
>> Currently we have a nasty mess of awk/shell/C/fortran programs that
>> extract and process some data from these files.  I have a dream of
>> a suite of simple, clear, maintainable python programs to do these
>> tasks.
>>
>> The trick is I hope to come up with something that our hardware
>> engineers can understand and maintain easily without studying
>> things like BNF, LALR etc. (they won't).
>>
>> [Below is a sample of one of the worst formats, shortened from a 40MB
>> file!]
>
> The format is quite amenable to parsing with re, if rather large...
> the real question is how much of that data do you need, and what do
> you need to do with it?  What do your current scripts actually do?
> Also, how much of the content you showed is *fixed* format, and how
> much of the format can vary?  Is anything optional?
>
> If you want the hardware engineers to be able to maintain it, you might
> want to support a kind of "template" specification, where you provide the
> names of various tags which are recognized (e.g. "PH_lot_id:") and you
> automatically extract the appropriate value found thereater.
>
> What you're trying to do is not really that complex, I think, so I
> do think you should be able to find a good simple solution for it.
> As you said, this _is_ Python...
>
> -Peter
> --
> http://mail.python.org/mailman/listinfo/python-list
>
>


-- 
http://www.acooke.org/andrew





More information about the Python-list mailing list