regexp: maximum recursion limit exceeded

Sun Jan 6 14:42:18 EST 2002

[F. GEIGER]
> As I faced the RuntimeError "maximum recursion limit exceeded" when I
> applied re.findall() to an HTML-file to find form contents, I thought
> the reason could be a limit within findall().
>
> So I tried the following to replace findall() with:
>
> def findall(re, string):
>    '''Find all /re/ in /string/.
>
>    Idea: Alex Martelli in a response to a Usenet post on 4th of January
> 2001.
>    '''
>    mos = []
>    pos = 0
>    while 1:
>       mo = re.search(string, pos)
>       if mo is None:
>          return [mo.group(0) for mo in mos]
>       mos.append(mo)
>       pos = mo.end()
>    return None

Python's implementation of re.findall is much the same.

> But the problem remains. Now re.search() reports the same error, which
> means, that not findall() but some deeper mechanisms have problems with
> the string that has to be searched in.

Yes, it's recursion at the C level in the regexp engine.  Alas, you're not
going to gain a useful understanding of that short of studying Friedl's
"Mastering Regular Expressions" book (highly recommended, if you find
yourself unable to resist regexps <wink>).

> If you want to reproduce the error, try this (any ill formatting caused
> by my newsreader, sorry):

Ditto.

> def test():
>    stringToBeSearchedIn = ("<form blablabla>%s</form>" %
> ("<blablabla>blablabla</blablabla> " * 500)) * 100
> #   print stringToBeSearchedIn
>
>    for stringFound in findall(re.compile(r"\<form.*?\/form\>", re.DOTALL |
> re.MULTILINE | re.IGNORECASE), stringToBeSearchedIn)[:10]:
>       print stringFound
>    return
>
> It's very likely, that a form causes this error, if the contents
> between the form tags are large and - more important - have many
> '<tag></tag>' pairs.

Read the book and this won't be a mystery.

> To overcome this, I could
>
> 1) use find() to search for '<form' and '/form>',

Simple, fast and obvious.  No wonder you'd rather fight with regexps <wink>.
Note that it can match all sorts of accidental crap, though, like "/form>"
inside a string or comment.  So will regexps.  If you want to be sure of a
correct result, there's no substitute for real parsing.

> 2) use the SGML parser,

That does "real parsing" so has the lovely property that the result is
always correct.  Also consider htmllib and HTMLParser.

> 3) re.search for the opening tag and kill everything up to it, then
> re.search for the closing tag and kill everything after it.

Have you forgotten re.search's optional pos argument already?

Here's a fourth:  rewrite the regexp, if you don't care about correctness in
all cases (regexps aren't powerful enough to parse HTML correctly -- they're
not powerful enough to parse any language with nested constructs).

pat = re.compile(r"""
    <form
    [^/]*
    (?: /
        (?! form>) .
        [^/]*
    )*
    /form>
""", re.DOTALL | re.IGNORECASE | re.VERBOSE)

You'll find that this works fine in your test program (see Friedl for pages
of explanation, which resist useful summarization).  It will match quickly
provided the string *has* a match; curiously, it's still prone to recursion
overflow if the string doesn't have a match, or if it does and there are
thousands of contained forward slashes.

> BTW, increasing the recursion depth doesn't solve the problem.

You can't increase the regexp engine's recursion depth limit:  the limit
it's griping about is a C stack limit.  sys.setrecursionlimit() is unrelated
to that (it limits the depth of Python-level calls, but the regexp engine
isn't coded in Python).

> What other options do I have? How is this done "Pythonicly"?

When you want to use regular expressions, do you import re or do you write
your own regexp engine?  Python comes with HTML parsers for the same reason:
doing a good job of it is difficult, and there's nothing more Pythonic than
building on other peoples' work.