mxTools (was Re: why no "do : until"?)

Wed Jan 10 04:48:28 EST 2001

Paul Prescod wrote asked:
>Cna mxTextTools work with streams without reading them all into memory?
>I find it quite a pain to use hand-crafted RE-based lexers on streaming
>information.

No, it cannot.  Nor can it use a memory mapped file.  :(

Luckily, my largest files are agglomeration of much smaller
records so I ended up using the two-level approach of parsing
out a record then passing it to mxTextTools to parse.  I
also use mxTextTools to identify the records.  In this case I
went through the pain of making that code chunkable, but because
the separators all fit on a single line it was at least doable.

Going back to /F's earlier question of what mxTextTools does
that the sre module doesn't, let me point out a project I've
been working on called Martel (see
http://www.biopython.org/~dalke/Martel).

A lot of the formats in computational biology and chemistry
are pretty simple from a CS language theory viewpoint.  They
are much closer to regular than context free.  However,
lexical identification is very dependent on where the text is
in the file so lex/yacc is cumbersome to use.

What I've done is write a parser generator which takes the
format description as a regular expression pattern and
creates the mxTextTools tag table.  This parses the file to
generate a parse tree, which is traversed to produce SAX
events to a ContentHandler.  The event names are defined
using the (?P<named>group) notation.

By using mxTextTools instead of sre I could use my alternate
method of passing parse tree information back using SAX
events.  Compare that to sre where if a group has a repeat
match in the text only the last match is returned.  I could
also create some new, non-standard syntax, like

  (?P<count>\d+)( (?P<word>\w+)){count}

which matches the "word" pattern "count" times, so it would
match
   1 Spam
   3 Spam and eggs
but not
   2 Ni

On the other hand, my implementation doesn't allow the same
amount of backtracking that a real regular expression engine
has, but I think that's because I haven't been able to figure
out how to do it so it works quickly - my parsers using
mxTextTools outperforms hand-written ones in Python and Perl.

 Time      Description
23:28.59  using mxTextTools
28:54.69  hand-written in Python
30:12.85  hand-written in Perl
38:20.65  a different Perl implementation

Yes, Python parses faster than Perl for this case!

As a tangential remark, the records start with a line which
starts with "ID".  One of the packages uses Perl's ability
to redefine the record delimiter from "\n" to "^ID".  That
identification occurs in Perl's internals so should be quite
fast.  The mxTextTools equivalent is even faster:

0:56.63 grep "^ID"
1:27.89 using mxTextTools to find "^ID"
1:47.71 using Perl's special variable

I guess now I should use mmap and see how long it takes the
sre module to find the 800,000 "\nID" records in my file.  :)

Oh, and I'll be presenting my paper about this project at
the upcoming Python conference.

                    Andrew
                    dalke at acm.org