mxTools (was Re: why no "do : until"?)
Andrew Dalke
dalke at acm.org
Wed Jan 10 04:48:28 EST 2001
Paul Prescod wrote asked:
>Cna mxTextTools work with streams without reading them all into memory?
>I find it quite a pain to use hand-crafted RE-based lexers on streaming
>information.
No, it cannot. Nor can it use a memory mapped file. :(
Luckily, my largest files are agglomeration of much smaller
records so I ended up using the two-level approach of parsing
out a record then passing it to mxTextTools to parse. I
also use mxTextTools to identify the records. In this case I
went through the pain of making that code chunkable, but because
the separators all fit on a single line it was at least doable.
Going back to /F's earlier question of what mxTextTools does
that the sre module doesn't, let me point out a project I've
been working on called Martel (see
http://www.biopython.org/~dalke/Martel).
A lot of the formats in computational biology and chemistry
are pretty simple from a CS language theory viewpoint. They
are much closer to regular than context free. However,
lexical identification is very dependent on where the text is
in the file so lex/yacc is cumbersome to use.
What I've done is write a parser generator which takes the
format description as a regular expression pattern and
creates the mxTextTools tag table. This parses the file to
generate a parse tree, which is traversed to produce SAX
events to a ContentHandler. The event names are defined
using the (?P<named>group) notation.
By using mxTextTools instead of sre I could use my alternate
method of passing parse tree information back using SAX
events. Compare that to sre where if a group has a repeat
match in the text only the last match is returned. I could
also create some new, non-standard syntax, like
(?P<count>\d+)( (?P<word>\w+)){count}
which matches the "word" pattern "count" times, so it would
match
1 Spam
3 Spam and eggs
but not
2 Ni
On the other hand, my implementation doesn't allow the same
amount of backtracking that a real regular expression engine
has, but I think that's because I haven't been able to figure
out how to do it so it works quickly - my parsers using
mxTextTools outperforms hand-written ones in Python and Perl.
Time Description
23:28.59 using mxTextTools
28:54.69 hand-written in Python
30:12.85 hand-written in Perl
38:20.65 a different Perl implementation
Yes, Python parses faster than Perl for this case!
As a tangential remark, the records start with a line which
starts with "ID". One of the packages uses Perl's ability
to redefine the record delimiter from "\n" to "^ID". That
identification occurs in Perl's internals so should be quite
fast. The mxTextTools equivalent is even faster:
0:56.63 grep "^ID"
1:27.89 using mxTextTools to find "^ID"
1:47.71 using Perl's special variable
I guess now I should use mmap and see how long it takes the
sre module to find the 800,000 "\nID" records in my file. :)
Oh, and I'll be presenting my paper about this project at
the upcoming Python conference.
Andrew
dalke at acm.org
More information about the Python-list
mailing list