New implementation of re module

Thu Jul 30 18:06:05 EDT 2009

Piet van Oostrum wrote:
>>>>>> MRAB <python at mrabarnett.plus.com> (M) wrote:
> 
>> M> Hi all,
>> M> I've been working on a new implementation of the re module. The details
>> M> are at http://bugs.python.org/issue2636, specifically from
>> M> http://bugs.python.org/issue2636#msg90954. I've included a .pyd file for
>> M> Python 2.6 on Windows if you want to try it out.
> 
>> M> I'm interested in how fast it is generally, compared with the current re
>> M> module, but especially when faced with those 'pathological' regular
>> M> expressions which seem to take a long time to finish, for example:
> 
>> M>     re.search(r"^(.+|D)*A$", "x" * 25 + "B")
> 
>> M> which on my PC (1.8GHz) takes 18.98secs with the re module but <0.01secs
>> M> with this new implementation.
> 
> Is this version also going to use the Thompson approach?

That approach is what inspired the method in the new implementation, but
its speed comes from asking _whether_ there's a match, not _where_
there's a match, so it can ignore the route taken to get the match.

"grep", for example, selects those lines which match a pattern, whereas
the typical Python (and Perl) usage is to extract part of a string or
which part of the string matches.

Trying to support lazy and possessive quantifiers in Thompson's approach
is tricky (at least, I haven't been able to figure it out), and
back-references are right out! :-)

There's always the chance that I could come up with some sort of hybrid
scheme, applying different approaches depending on certain conditions.
It already switches from depth-first to breadth-first when it's taken
too many iterations without advancing and merges similar states, which
is why it's so much better with 'pathological' patterns; I just have to
get the depth-first phase to be as fast as the current 're' module.