Plex 0.1 - A Lexical Analysis Module

Wed Feb 9 00:08:44 EST 2000

[posted & mailed]

[Greg Ewing]
> Having spent a couple of years wondering "when is someone
> going to write a decent lexing module for Python", I finally
> decided to do it myself.

Yay!

> You can take a sneak preview of what I've come up with so
> far at:
>
> http://www.cosc.canterbury.ac.nz/~greg/python/Plex
>
> Briefly, you construct a scanner from a bunch of regexps,
> rather like flex, set it loose on a character stream,
> and it returns you tokens.
>
> Main selling point: All the regexps are compiled into a single
> DFA, which allows input to be processed in time linear in the
> number of characters to be scanned, regardless of the number
> or complexity of the regexps.

I don't have time to play, but gave it a quick look.  Three suggestions, two
of which you already beat me to <wink>:

1. C for speed.  mxTextTools is your (only?) competition here.  The lack of
a standard blazing lexer has hurt many Python projects, and wastes time on
many more as people micro-optimize the snot out of one-shot "trick the re
engine" approaches.

2. More conventional (regexp) syntax (although I wouldn't give up the
"functional" notation you have now either!  especially not in light of the
next one).

3. Less conventional (for a scanner) FSM operations.  I've long used (but
never got around to releasing or documenting) a set of overly general Python
FSM classes that support direct computation of regular language
intersection, complement, difference, and reversal, as well as regexpy
unions and catenations.  This can be very handy (match everything like
*this* except like *that* ...), and they all fit into the "single
linear-time DFA" model.  It's a great way to increase the practical power
without actually doing more work <wink -- although complements are tricky to
get right if you refuse to specify the alphabet in advance>.

encouragingly y'rs  - tim