re module non-greedy matches broken

André Malo auch-ich-m at g-kein-spam.com
Mon Apr 4 02:39:36 EDT 2005


* "lothar" <lothar at ultimathule.nul> wrote:

> this response is nothing but a description of the behavior i reported.

Then you have not read my response carefully enough.

> as to whether this behaviour was intended, one would have to ask the module
> writer about that.

No, I've responded with a view on regexes, not on the module. That is the way
_regexes_ work. Non-greedy regexes do not match the minimal-length at all, they
are just ... non-greedy (technically the backtracking just stacks the longest
instead of the shortest). They *may* match the shortest match, but it's a
special case. Therefore I've stated that the documentation is incomplete.

Actually your expectations go a bit beyond the documentation. From a certain
point of view (matches always start most left) the matches you're seeing
*are* the minimal-length matches.

> because of the statement in the documentation, which places no qualification
                                                              ^^^^^^^^^^^^^^^^
                                                              that's the point.

> on how the scan for the shortest possible match is to be done, my guess is
> that this problem was overlooked.

In the docs, yes. But buy yourself a regex book and learn for yourself ;-)
The first thing you should learn about regexes is that the source of pain
of most regex implementations is the documentation, which is very likely
to be wrong.

Finally let me ask a question:

import re
x = re.compile('<.*?>')
print x.search('<title>...</title><body>...</body>').group(0)

What would you expect to be printed out? <title> or <body>? Why?

nd



More information about the Python-list mailing list