[Python-Dev] re performance

Sat Jan 28 06:44:37 EST 2017

Hi Sven,

On 26 January 2017 at 22:13, Sven R. Kunze <srkunze at mail.de> wrote:
> I recently refreshed regular expressions theoretical basics *indulging in
> reminiscences* So, I read https://swtch.com/~rsc/regexp/regexp1.html

Theoretical regular expressions and what Python/Perl/etc. call regular
expressions are a bit different.  You can read more about it at
https://en.wikipedia.org/wiki/Regular_expression#Implementations_and_running_times
.

Discussions about why they are different often focus on
backreferences, which is a rare feature.  Let me add two other points.

The theoretical kind of regexp is about giving a "yes/no" answer,
whereas the concrete "re" or "regexp" modules gives a match object,
which lets you ask for the subgroups' location, for example.  Strange
at it may seem, I am not aware of a way to do that using the
linear-time approach of the theory---if it answers "yes", then you
have no way of knowing *where* the subgroups matched.

Another issue is that the theoretical engine has no notion of
greedy/non-greedy matching.  Basically, you walk over the source
character and it answers "yes" or "no" after each of them.  This is
different from a typical backtracking implementation.  In Python:

>>> re.match(r'a*', 'aaa')
>>> re.match(r'a*?', 'aaa')

This matches either three or zero characters in Python.  The two
versions are however indistinguishable for the theoretical engine.

A bientôt,

Armin.