[Python-ideas] Proposed convenience functions for re module

Talin talin at acm.org
Wed Jul 22 18:23:25 CEST 2009


Steven D'Aprano wrote:
> Following the thread "Experiment: Adding "re" to string objects.", I 
> would like to propose the addition of two convenience functions to the 
> re module:
> 
> 
> def multimatch(s, *patterns):
>     """Do a re.match on s using each pattern in patterns, 
>     returning the first one to succeed, or None if they all fail."""
>     for pattern in patterns:
>         m = re.match(pattern, s)
>         if m: return m
>

There's a cute trick that you can use to do this that is much more 
efficient than testing each regex expression individually:

   combined_pattern = "|".join("(%s)" % p for p in patterns)
   combined_re = re.compile(combined_pattern)

   m = combined_re.match(string)
   return m.lastindex

Basically, it combines all of the patterns into a single large regex, 
where each pattern is converted into a capturing group. It then returns 
match.lastindex, which is the index of the capturing group that matched. 
This is very efficient because now all of the patterns are combined into 
a single NFA which can prune possibilities very quickly.

This works for up to 99 patterns, which is the limit on the number of 
capturing groups that a regex can have.

I use this technique in my Python-based lexer, Reflex:

    http://pypi.python.org/pypi/reflex/0.1

Now, if we are talking about convenience functions, what I would really 
like to see is a class that wraps a string that allows matches to be 
done incrementally, where each successful match consumes the head of the 
string, leaving the remainder of the string for the next match.

This can be done very efficiently since the regex functions all take a 
start-index parameter. Essentially, the wrapper class would update the 
start index each time a successful match was performed.

So something like:

    stream = MatchStream(string)
    while 1:
       m = stream.match(combined_re)
       # m is a match object
       # Do something with m

Or even an iterator over matches (this assumes that you want to use the 
same regex each time, which may not be the case for a parser):

    stream = MatchStream(string)
    for m in stream.match(combined_re):
       # m is a match object
       # Do something with m

-- Talin



More information about the Python-ideas mailing list