[Tutor] re module fails to handle text > 16142 characters ???
pan@uchicago.edu
pan@uchicago.edu
Tue May 20 22:44:02 2003
Thx to all those replied. I did use a minimal-match construct
( .*? ) like what Tim mentioned (see below).
Since Tim's reply somehow didn't show up on the list (even though
in his email to me the tutor@python.org did show up on the "cc:"
field of my email program) so I am forwarding his explanation to
you guys.
pan
Quoting Tim Peters <tim_one@email.msn.com>:
> [pan@uchicago.edu]
> > The other day I was using re to do some parsing and
> > found that one of the text that I tried to parse returned
> > an error :
> >
> > return reObj.findall(doc)[0]
> > RuntimeError: maximum recursion limit exceeded
> >
> > After some painful debugging, I found:
> >
> > let S = len(doc)
> >
> > 1. When S > 16143: re.find function failed;
> > 2. When S = 16143: failed too
> > 3. When S = 16142: successful. This was tested by deleting ANY
> > character in that 16143-long doc.
> >
> > I am using Python 2.2.1. Is this a known-bug???
>
> You're probably using a minimal-match construct in the regexp, like
>
> .*?
>
> Minimal matches are done using recursion, proportional to the length of the
> substring they're trying to match. If so, it's a limitation of the
> implementation, but isn't considered to be a bug -- that's just the way it
> works. If you had infinite memory, it wouldn't complain <wink>. It's akin
> to trying to write a 20 gigabyte file on a system with a 1 gigabyte disk.
>
> > I would really like to post the entire code here but it's very
> > long, so I think it's better to ask first to see if it's a known
> > bug.
>
> Only one statement in your code is relevant: the regular expression. For
> example, this program should fail for you too:
>
> """
> import re
>
> pattern = re.compile('x*?z')
> print pattern.match('x' * 50000 + 'z')
> """
>
> It's almost always possible to rewrite a regular expression in a way that
> doesn't require unbounded recursion, but how to do so in all possible cases
> is a book-length topic. Some are easy, such as the example above: minimal
> match doesn't accomplish anything there, and the regexp
>
> x*z
>
> matches the same stuff without using a minimal match operator. The latter
> regexp can be used to match strings with millions of characters with no
> problems.
>
> A more common transformation is from (for example)
>
> \(.*?\)
>
> to
>
> \([^)]*\)
>
> Again the latter spelling can match parenthesized text with millions of
> characters, but the former will die after some thousands (the exact limit
> depends on your platform -- it runs out of room in C's stack for more
> recursive calls, and the amount of memory devoted to the C stack depends on
> your platform).
>
> Minimal matches can be convenient but are limited in how long a substring
> they can match. Another way to get unbounded recursion is with a regexp
> that stacks up too many backtracking points, but that's even harder to
> explain. Friedl's book "Mastering Regular Expressions" (O'Reilly) is an
> excellent intro to the topic.
>