[Python-Dev] re performance

Wed Feb 1 23:37:16 EST 2017

On Thu, Jan 26, 2017 at 4:13 PM, Sven R. Kunze <srkunze at mail.de> wrote:
> Hi folks,
>
> I recently refreshed regular expressions theoretical basics *indulging in
> reminiscences* So, I read https://swtch.com/~rsc/regexp/regexp1.html
>
> However, reaching the chart in the lower third of the article, I saw Python
> 2.4 measured against a naive Thompson matching implementation. And I was
> surprised about how bad it performed compared to an unoptimized version of
> an older than dirt algorithm.

> From my perspective, I can say, that regular expressions might worth
> optimizing especially for web applications (url matching usually uses
> regexes) but also for other applications where I've seen many tight loops
> using regexes as well. So, I am probing interest on this topic here.

What I (think I) know:
- Both re and regex use the same C backend, which is not based on NFA.
- The re2 library, which the writer of that article made, allows
capture groups (but only up to a limit) and bounded repetitions (up to
a limit).
- Perl has started to optimize some regex patterns.

What I think:
- The example is uncharacteristic of what people write, like URL matching.
- But enabling naive code to perform well is usually a good thing. The
fewer details newbies need to know to write code, the better.
- It's possible for Python to optimize a large category of patterns,
even if not all.
- It's also possible to optimize large parts of patterns with
backrefs. E.g. If there's a backref, the group that the backref refers
to can still be made into an NFA.
- To do the above, you'd need a way to generate all possible matches.
- Optimization can be costly. The full NFA construction could be
generated only upon request, or maybe the code automatically tries to
optimize after 100 uses (like a JIT). This should only be considered
if re2's construction really is costly.

If people want NFAs, I think the "easiest" way is to use re2. Jakub
Wilk mentioned it before, but here it is again.
https://github.com/google/re2

re2 features:
https://github.com/google/re2/wiki/Syntax

Named references aren't supported, but I think that can be worked
around with some renumbering. It's just a matter of translating the
pattern.

It also doesn't like lookarounds, which AFAIK are perfectly doable
with NFAs. Looks like lookaheads and lookbehinds are hard to do
without a lot of extra space (https://github.com/google/re2/issues/5).

Facebook has a Python wrapper for re2.
https://github.com/facebook/pyre2/

In a post linked to from this thread, Serhiy mentioned another Python
wrapper for re2. This wrapper is designed to be like re, and should
probably be the basis of any efforts. It's not been updated for 11
months, though.
https://pypi.python.org/pypi/re2/
https://github.com/axiak/pyre2/