Regexp optimization question

Greg Ewing greg at cosc.canterbury.ac.nz
Mon Apr 26 01:11:38 EDT 2004


Magnus Lie Hetland wrote:
> BTW: I have not done some experiments with Plex with lots of regular
> expressiosn; simply compiling a pattern with 500 alternatives took
> forever, whereas re.compile was quite fast.

Yeow, that's really thrashing Plex to death. Did you wait
for it to finish? I'm curious how long it actually took.

Plex is really designed for parsing programming languages
and the like, where the number of patterns is small enough
to be written by hand, and some common sense used in the
process. For example, if you have a lot of patterns with a
regularity to them such as "foo1", "foo2", ..., "foo500"
it will be much faster to compile a pattern such as

   Str("foo") + (Range("19") | (Range("09") + Range("09"))
      | (Range("14") + Range("09") + Range("09")) | Str("500"))

than to compile a huge expression with 500 alternatives.
The resulting state machine will likely be a lot smaller,
too.

On the plus side, scanning will be just as fast however
you write the REs, which is not true for most RE
implementations of the Perl variety.

One more tip: If you're careful what sort of things you
use for actions, you can pickle the Lexicon, so that it
only needs to be generated once rather than each time the
program starts up. I'm currently doing this in Pyrex, and
it works quite well. Let me know if you want any more info
about doing this.

-- 
Greg Ewing, Computer Science Dept,
University of Canterbury,	
Christchurch, New Zealand
http://www.cosc.canterbury.ac.nz/~greg




More information about the Python-list mailing list