[Python-Dev] re performance
Armin Rigo
armin.rigo at gmail.com
Sat Jan 28 06:44:37 EST 2017
Hi Sven,
On 26 January 2017 at 22:13, Sven R. Kunze <srkunze at mail.de> wrote:
> I recently refreshed regular expressions theoretical basics *indulging in
> reminiscences* So, I read https://swtch.com/~rsc/regexp/regexp1.html
Theoretical regular expressions and what Python/Perl/etc. call regular
expressions are a bit different. You can read more about it at
https://en.wikipedia.org/wiki/Regular_expression#Implementations_and_running_times
.
Discussions about why they are different often focus on
backreferences, which is a rare feature. Let me add two other points.
The theoretical kind of regexp is about giving a "yes/no" answer,
whereas the concrete "re" or "regexp" modules gives a match object,
which lets you ask for the subgroups' location, for example. Strange
at it may seem, I am not aware of a way to do that using the
linear-time approach of the theory---if it answers "yes", then you
have no way of knowing *where* the subgroups matched.
Another issue is that the theoretical engine has no notion of
greedy/non-greedy matching. Basically, you walk over the source
character and it answers "yes" or "no" after each of them. This is
different from a typical backtracking implementation. In Python:
>>> re.match(r'a*', 'aaa')
>>> re.match(r'a*?', 'aaa')
This matches either three or zero characters in Python. The two
versions are however indistinguishable for the theoretical engine.
A bientôt,
Armin.
More information about the Python-Dev
mailing list