using re: hitting recursion limit

Tue Oct 26 16:00:00 EDT 2004

    I have done a fair amount of regular expression text processing in Perl,
and am currently trying to convert a running Perl script into Python (for a
number of reasons I won't go into here). I have not had any problems with
memory limits using Perl, but in trying to clip out a particular table from
a web page, I am hitting Python's recursion limit.

The RE is pretty simple:

pat = '(<table.*?%s.*?</table>)' % magic_string

    This seems about as simple as a "real-world" RE can get. If I cut down
the web page to about 150 lines, this works, but that's not practical - the
table I need to parse can easily gro to over 1000 lines.  I found the
following bit in the reference manual from section 4.2.6:

#------------------------------------------------------------------------
Avoiding recursion
If you create regular expressions that require the engine to perform a lot
of recursion, you may encounter a RuntimeError exception with the message
maximum recursion limit exceeded. For example,

>>> import re
>>> s = 'Begin ' + 1000*'a very long string ' + 'end'
>>> re.match('Begin (\w| )*? end', s).end()
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "/usr/local/lib/python2.3/sre.py", line 132, in match
    return _compile(pattern, flags).match(string)
RuntimeError: maximum recursion limit exceeded
You can often restructure your regular expression to avoid recursion.

Starting with Python 2.3, simple uses of the *? pattern are special-cased to
avoid recursion. Thus, the above regular expression can avoid recursion by
being recast as Begin [a-zA-Z0-9_ ]*?end. As a further benefit, such regular
expressions will run faster than their recursive equivalents.

#------------------------------------------------------------------------
    Yeah, so this is a known problem, and sure enough, if you run this
example, it will crap out as shown above.  I tried to build 2.3 from sources
before and ran into other funky problems. When I get a standard SuSE
distribution, I will upgrade to 2.3. In the meantime, I'm running 2.2.

If you take the advice and recast the RE as they suggest...

>>> pat = r'Begin [a-zA-Z0-9_ ]*?end'
>>> re.match(pat, s)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "/usr/lib/python2.2/sre.py", line 132, in match
    return _compile(pattern, flags).match(string)
RuntimeError: maximum recursion limit exceeded
>>>

    Ummm... wow... big improvement. :(  And what if I am trying to match
'.*?' instead of '\w*?', is trying to build the equivalent mess for dot
going to help me? Didn't help for \w here anyway.  So, there may be various
ways I can hack my way around this, but this is not making me a happy
camper. There are a lot of things I like about Python, but running into this
sort of thing so early is definitely not giving me a warm fuzzy feeling
about using Python as a general replacement for Perl.

    One question is, "Does getting this error actually mean I have hit my
process' stack memory limit or is there possibly some artificial recursion
limit set in front of that?" If there is some number I can push up, I'd be
happy to let it run until the OS axes it for exceeding it's process limits
(i.e., I'm not clear on whether this RuntimeError imples that's what has
already happened). I'm not finding much of anything trying to Google "Python
recursion limit" (though I obviously haven't read all of the 10,000+ pages
Google will hit.) It hits this error basically "immediately", so it seems
unlikely to me that it has actually use all of the machiens memory, or even
all of my process' memory in under a tenth of a second:

ej at sand:~/src/python> time ./find_table.py

pat = '(<table.*?Unit No.*?</table>)'

Traceback (most recent call last):
  File "./find_table.py", line 22, in ?
    match = re.search(pat, html, re.DOTALL)
  File "/usr/lib/python2.2/sre.py", line 137, in search
    return _compile(pattern, flags).search(string)
RuntimeError: maximum recursion limit exceeded

real    0m0.049s
user    0m0.050s
sys     0m0.000s

    I can probably hack my way around this and just use substrings to pick
out my table but there are other situations where there are multiple tables
before, after, and nested around the table that I want, all of which change
dynamically. I can't be very confident that a magic string alone inside
table tags is going to be unique. RE definitely seems like the fast and
painless (or rather, less painful) way to go here.  So, short of telling me
to use 2.3 or a way I can reconfigure what the recursion limit is,
suggestions about smarter ways to pick out a table under those conditions
are welcomed.

    As sort of an aside, I am aware of the ulimit command, as I poked around
with it a bit years ago, but am no systems guru. I remember it being a
stand-alone program under /sbin, but on my SuSE 8.2 system it appears to be
incorporated into bash itself? I do have root privs, so if you think you
have a solution via ulimit, I'm happy to hear, but it would appear I am not
being restricted by that:

ej at sand:~/src/python> ulimit
unlimited
ej at sand:~/src/python>

ej at sand:~/src/python> ulimit -a
core file size        (blocks, -c) 0
data seg size         (kbytes, -d) unlimited
file size             (blocks, -f) unlimited
max locked memory     (kbytes, -l) unlimited
max memory size       (kbytes, -m) unlimited
open files                    (-n) 1024
pipe size          (512 bytes, -p) 8
stack size            (kbytes, -s) unlimited
cpu time             (seconds, -t) unlimited
max user processes            (-u) 2047
virtual memory        (kbytes, -v) unlimited

    Thanks for taking the time to read my post! :)

-ej