How to get the "longest possible" match with Python's RE module?

Tim Peters tim.peters at gmail.com
Tue Sep 12 02:52:41 EDT 2006


[Licheng Fang[
> ...
> In fact, what I'm doing is handle a lot of regular expressions. I
> wanted to build VERY LONG regexps part by part and put them all into a
> file for easy modification and maintenance. The idea is like this:
>
> (*INT) = \d+
> (*DECIMAL) = (*INT)\.(*INT)
> (*FACTION) = (*DECIMAL)/(*DECIMAL)
> (*NUMERALS) = (*FACTION)|(*DECIMAL)|(*INT)
> ... ...
>
> What's inside the sytactically wrong (* and ) is something to be
> replaced, and then I wrote a little script to do the string
> substitution, to get very long regexps to be compiled. I thought that
> might be a way to handle long and complex regexps, and in this course I
> encountered the problem with the semantics of '|'.
>
> My approach may sound desperate and funny, but please, do you have any
> good idea as to how to handle long and complex regexps?

My best advice is to find a different approach entirely.  For example,
build a parser using parser technology, and use regexps in that /only/
to do gross tokenization ("string of digits", "decimal point", ...):
build the rest with a grammar.

Regexps are a brittle tool, best tolerated in small doses.  For an
"NFA" implementation like Python's, you're likely to experience poor
speed when combining many complex regexps, and /especially/ when
alternatives are ambiguous wrt prefixes (and yours are, else you
wouldn't have a problem with "longest match" versus "some shorter
match" to begin with.  OTOH, under a "DFA" implementation such as
POSIX grep's, you're likely to experience exponential memory
requirements (DFA implementations can need to build enormous state
machines, tracing out in advance all possible paths through all the
regexps applied to all possible input strings).

Just sounds to me like the wrong tool for the job.



More information about the Python-list mailing list