[Tutor] re module fails to handle text > 16142 characters ???

pan@uchicago.edu pan@uchicago.edu
Tue May 20 22:44:02 2003


Thx to all those replied. I did use a minimal-match construct
( .*? ) like what Tim mentioned (see below). 

Since Tim's reply somehow didn't show up on the list (even though
in his email to me the tutor@python.org did show up on the "cc:" 
field of my email program) so I am forwarding his explanation to 
you guys.


pan



Quoting Tim Peters <tim_one@email.msn.com>:

> [pan@uchicago.edu]
> > The other day I was using re to do some parsing and
> > found that one of the text that I tried to parse returned
> > an error :
> >
> >    return reObj.findall(doc)[0]
> >    RuntimeError: maximum recursion limit exceeded
> >
> > After some painful debugging, I found:
> >
> > let S = len(doc)
> >
> > 1. When S > 16143: re.find function failed;
> > 2. When S = 16143: failed too
> > 3. When S = 16142: successful. This was tested by deleting ANY
> >    character in that 16143-long doc.
> >
> > I am using Python 2.2.1. Is this a known-bug???
> 
> You're probably using a minimal-match construct in the regexp, like
> 
>     .*?
> 
> Minimal matches are done using recursion, proportional to the length of the
> substring they're trying to match.  If so, it's a limitation of the
> implementation, but isn't considered to be a bug -- that's just the way it
> works.  If you had infinite memory, it wouldn't complain <wink>.  It's akin
> to trying to write a 20 gigabyte file on a system with a 1 gigabyte disk.
> 
> > I would really like to post the entire code here but it's very
> > long, so I think it's better to ask first to see if it's a known
> > bug.
> 
> Only one statement in your code is relevant:  the regular expression.  For
> example, this program should fail for you too:
> 
> """
> import re
> 
> pattern = re.compile('x*?z')
> print pattern.match('x' * 50000 + 'z')
> """
> 
> It's almost always possible to rewrite a regular expression in a way that
> doesn't require unbounded recursion, but how to do so in all possible cases
> is a book-length topic.  Some are easy, such as the example above:  minimal
> match doesn't accomplish anything there, and the regexp
> 
>     x*z
> 
> matches the same stuff without using a minimal match operator.  The latter
> regexp can be used to match strings with millions of characters with no
> problems.
> 
> A more common transformation is from (for example)
> 
>     \(.*?\)
> 
> to
> 
>     \([^)]*\)
> 
> Again the latter spelling can match parenthesized text with millions of
> characters, but the former will die after some thousands (the exact limit
> depends on your platform -- it runs out of room in C's stack for more
> recursive calls, and the amount of memory devoted to the C stack depends on
> your platform).
> 
> Minimal matches can be convenient but are limited in how long a substring
> they can match.  Another way to get unbounded recursion is with a regexp
> that stacks up too many backtracking points, but that's even harder to
> explain.  Friedl's book "Mastering Regular Expressions" (O'Reilly) is an
> excellent intro to the topic.
>