[Python-Dev] Some questions about maintenance of the regular expression code.
Gary Herron
gherron@islandtraining.com
Wed, 26 Feb 2003 10:34:11 -0800
On Wednesday 26 February 2003 10:23 am, M.-A. Lemburg wrote:
> Gary Herron wrote:
> > On Wednesday 26 February 2003 01:08 am, M.-A. Lemburg wrote:
> >>Gary Herron wrote:
> >>>The first glance at the regular expression bug list and the _sre.c
> >>>code results in the observation that several of the bugs are related
> >>>to running over the recursion limit. The problem comes from using a
> >>>pattern containing ".*?" in a situation where it is expected to match
> >>>many thousands of characters. Each character matched by ".*?" causes
> >>>one level or recursion, quickly overflowing the recursion limit.
> >>
> >>Wouldn't it be possible for the RE compiler to issue a warning in
> >>case these kind of patterns are used ? This would be much more helpful
> >>than trying to work-around the user problem.
> >
> > I think not. It's not the pattern that's the problem. A pattern
> > containing ".*?" is perfectly legitimate and useful.
>
> Hmm, could you explain where ".*?" is useful ?
Yes, easily. It's the non-greedy version of "match all". The manual
page for the re module has this nice example:
*?, +?, ??
The "*", "+", and "?" qualifiers are all greedy; they match as much
text as possible. Sometimes this behaviour isn't desired; if the RE
<.*> is matched against '<H1>title</H1>', it will match the entire
string, and not just '<H1>'. Adding "?" after the qualifier makes it
perform the match in non-greedy or minimal fashion; as few
characters as possible will be matched. Using .*? in the previous
expression will match only '<H1>'.
> > The problem
> > arises when the pattern is used on a string which has thousands of
> > characters which match. By that point the RE compiler is right out of
> > the picture.