using re: hitting recursion limit

Tue Oct 26 16:32:42 EDT 2004

Erik Johnson wrote:
> pat = '(<table.*?%s.*?</table>)' % magic_string
> 
>     This seems about as simple as a "real-world" RE can get. If I cut down
> the web page to about 150 lines, this works, but that's not practical - the
> table I need to parse can easily gro to over 1000 lines.

Another option is to use assertion checks.  The following seems to work

 >>> text = """
...
... Blah blah blah.
... <table id="foo">
... asdfasdfadfasdfasdfaf asdfasdfadfasdfasdfaf asdfasdfadfasdfasdfaf
... werwlkejr wljr wklerwlerwerwuoi dofg dif poritertkle ggkdfjgdlg
... q u f jkh gertwerkjshk klajw h jhkrjesdfoisdjf sjhdfkjhwer
... werwlkejr wljr wklerwlerwerwuoi dofg dif poritertkle ggkdfjgdlg
... werwlkejr wljr wklerwlerwerwuoi dofg dif poritertkle ggkdfjgdlg
... werwlkejr wljr wklerwlerwerwuoi dofg dif poritertkle ggkdfjgdlg
... werwlkejr wljr wklerwlerwerwuoi hello dif poritertkle ggkdfjgdlg
... </table>
... """
 >>> import re
 >>> magic_string = re.escape("hello")
 >>> pat = re.compile(
...  r"<table (|(.(?!</table>)(?!%s))*.)%s(|(.(?!</table>))*.)</table>"
..      % (magic_string, magic_string), re.DOTALL)
 >>> m = pat.search(text)>>> m.start(0)
18
 >>> m.end(0)
490
 >>> text[m.end(0)-10:m.end(0)]
'g\n</table>'
 >>>

The .(?! ... ) pattern consumes all characters that don't
have the ... immediately afterwards.  So the first big group
skips all characters that don't have magic_string or </table>
immediately afterwards, then it requires the magic_string,
then skips all characters that don't have '</table>' immediately
afterwards, then consumsed the </table>'

The .) is to consume the character which does immediately
follow with the desired pattern.

The (| is for the case when, say, "</table>" immediately
follows magic_string.

>     One question is, "Does getting this error actually mean I have hit my
> process' stack memory limit or is there possibly some artificial recursion
> limit set in front of that?"

It's an artificial recursion limit set to make sure you don't bust your
system's stack.  I think the re engine respects sys.setrecursionlimit().

In your case you have one stack frame for every character, so if
your current limit is 150 characters then you'll need to double the
stack limit to hit 300 character, etc.  You should consider some
other solution.

> It hits this error basically "immediately", so it seems
> unlikely to me that it has actually use all of the machiens memory, or even
> all of my process' memory in under a tenth of a second:

The stack space is a lot smaller than the heap space.  On my
OS X box I only have 8M of stack by default.

% limit
cputime         unlimited
filesize        unlimited
datasize        6144 kbytes
stacksize       8192 kbytes
coredumpsize    0 kbytes
memoryuse       unlimited
descriptors     256
memorylocked    unlimited
maxproc         100
%

 > RE definitely seems like the fast and
> painless (or rather, less painful) way to go here.  So, short of telling me
> to use 2.3 or a way I can reconfigure what the recursion limit is,
> suggestions about smarter ways to pick out a table under those conditions
> are welcomed.

An HTML parser is a better bet.  HTML is just over the border of
what's parsable by HTML -- it works, but only if you can make some
strong assertions about the orgianization of your HTML.  (Eg, that
characters aren't represented by character entities, that whitespace
will be as you expect, and more.)

Have you considered  BeautifulSoup, ElementTree, or using
Python's sgmllib?  See the writing of Mark Pilgrim for using
the last in his FeedParser
   http://sourceforge.net/projects/feedparser/

				Andrew
				dalke at dalkescientific.com