[ python-Feature Requests-1662581 ] the re module can perform poorly: O(2**n) versus O(n**2)

SourceForge.net noreply at sourceforge.net
Wed Apr 4 23:23:01 CEST 2007


Feature Requests item #1662581, was opened at 2007-02-17 18:39
Message generated for change (Settings changed) made by tjreedy
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=355470&aid=1662581&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
>Category: Performance
>Group: None
Status: Open
Resolution: None
Priority: 3
Private: No
Submitted By: Gregory P. Smith (greg)
Assigned to: Nobody/Anonymous (nobody)
Summary: the re module can perform poorly: O(2**n) versus O(n**2)

Initial Comment:
in short, the re module can degenerate to really really horrid performance.  See this for how and why:

 http://swtch.com/~rsc/regexp/regexp1.html

exponential decline instead of squared.

I don't have a patch so i'm filing this bug as a starting point for future work.  The Modules/_sre.c files implementation could be updated to use the parallel stepping Thompson approach instead of recursive backtracking.

filing this as a bug until me or someone else comes up with a patch.

----------------------------------------------------------------------

Comment By: Gregory P. Smith (greg)
Date: 2007-02-22 17:30

Message:
Logged In: YES 
user_id=413
Originator: YES

yeah this is better as a feature request.  certianly low priority either
way.

-nothing- I propose doing would change the syntax or behaviour of existing
regular expressions at all.  Doing so would be a disaster.  thompson nfa
does not imply changing the behaviour.

anyways its a lot more than a simple "patch" to change the re module to
not use backtracking so i expect this to languish unless someone has a of
free time and motivation all at once. :)


----------------------------------------------------------------------

Comment By: Josiah Carlson (josiahcarlson)
Date: 2007-02-22 03:51

Message:
Logged In: YES 
user_id=341410
Originator: NO

I would file this under "feature request"; the current situation isn't so
much buggy, as slow.  While you can produce a segfault with the current
regular expression engine (due to stack overflow), you can do the same
thing with regular Python on Linux (with sys.setrecursionlimit), ctypes,
etc., and none of those are considered as buggy.

My only concern with such a change is that it may or may not change the
semantics of the repeat operators '*' and '+', which are currently defined
as "greedy".  If I skimmed the article correctly late at night, switching
to a Thompson family regular expression engine may result in those
operators no longer being greedy.  Please correct me if I am wrong.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=355470&aid=1662581&group_id=5470


More information about the Python-bugs-list mailing list