[Python-bugs-list] [Bug #116136] sre recursion limit hit in tokenize.py

noreply@sourceforge.net noreply@sourceforge.net
Fri, 6 Oct 2000 22:13:18 -0700


Bug #116136, was updated on 2000-Oct-05 07:21
Here is a current snapshot of the bug.

Project: Python
Category: Library
Status: Open
Resolution: None
Bug Group: Platform-specific
Priority: 6
Summary: sre recursion limit hit in tokenize.py

Details: While running Tim Peters' new reindent script over my code,
I received the following traceback:

checking ./dtcout.py ...
Traceback (most recent call last):
  File "/usr/local/bin/reindent", line 258, in ?
    main()
  File "/usr/local/bin/reindent", line 65, in main
    check(arg)
  File "/usr/local/bin/reindent", line 77, in check
    check(fullname)
  File "/usr/local/bin/reindent", line 90, in check
    if r.run():
  File "/usr/local/bin/reindent", line 135, in run
    tokenize.tokenize(self.getline, self.tokeneater)
  File "/usr/local/lib/python2.0/tokenize.py", line 152, in tokenize
    pseudomatch = pseudoprog.match(line, pos)
RuntimeError: maximum recursion limit exceeded

The pattern compiled into pseudoprog is

[ \f\t]*((\\\r?\n|#[^\r\n]*|([rR]?'''|[rR]?"""))|((0[jJ]|[1-9]\d*[jJ]|((\d+\.\d*|\.\d+)([eE][-+]?\d+)?|[1-9]\d*[eE][-+]?\d+)[jJ])|((\d+\.\d*|\.\d+)([eE][-+]?\d+)?|[1-9]\d*[eE][-+]?\d+)|(0[xX][\da-fA-F]*[lL]?|0[0-7]*[lL]?|[1-9]\d*[lL]?))|((\+=|\-=|\*=|%=|/=|\*\*=|&=|\|=|\^=|>>=|<<=|\+|\-|\*\*|\*|\^|~|/|%|&|\||<<|>>|==|<=|<>|!=|>=|=|<|>)|[][(){}]|(\r?\n|[:;.,`]))|([rR]?'(\\.|[^\n'\\])*('|\\\r?\n)|[rR]?"(\\.|[^\n"\\])*("|\\\r?\n))|[a-zA-Z_]\w*)

I am not even going to try to figure out what part of the re triggers the excessive recursion.  Here is the source file reindent was munching on when the exception was raised.  Hopefully the bug submission process won't destroy it.  If so, let me know (skip@mojam.com) and I'll send it to you.

def dtmlfunc(**__d):
    import sys, urllib, string
    output = []
    append = output.append
    append('<!-- Generated: Wed Mar 10 12:13:08 1999 -->')
    append('\012<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN"> <!-- -*-html-helper -*- -->\012<HTML>\012<HEAD>\012<TITLE>Musi-Cal Search: ')
    append("""%s""" % __d['search_key'])
    append('</TITLE>\012<LINK REV="made" HREF="mailto:concertmaster@musi-cal.com">\012</HEAD>\012<BODY bgcolor="#ffffff" text="#000000" link="#CC0000" alink="#FF3300" vlink="#330099"\012>\012\012')
    append("""%s""" % __d['banner'])
    append('\012\012<H2 align="center">Musi-Cal Search Response</H2>\012\012')
    if __d['repeat_msg']:
        append('\012  ')
        append("""%s""" % __d['repeat_msg'])
        append('\012  <hr>\012')
    append('\012\012')
    if __d['entries']:
        append('\012\012    ')
        if __d['indirect']:
            append('\012      <P>\012      You can invoke this search using the following URL.\012      Why not save it for future reference?\012      <p align=center><code>http://')
            append("""%s""" % __d['servername'])
            append('/cgi-bin/event-search?')
            append("""%s""" % __d['query_string'])
            append('</code></p>\012    ')
        append('\012    \012    <p>\012    Click <img src="http://www.musi-cal.com/icons/vdb10.gif" alt="[VDB]"> to look up venue information.<br>\012    ')
        if __d['next']:
            append('\012    <font size="-1">\012      <a href="')
            append("""%s""" % __d['next'])
            append('">[Next Page]</a>\012    </font>\012    ')
        append('\012    ')
        if __d['prev']:
            append('\012    <font size="-1">\012      <a href="')
            append("""%s""" % __d['prev'])
            append('">[Previous Page]</a><br>\012    </font>\012    ')
        else:
            append('\012      ')
            if __d['next']:
                append('<br>')
            append('\012    ')
        append('\012\012    <TABLE border=0 cellspacing="5" cellpadding="0" width="100%">\012\012    ')
        __i_entries = -1
        for __elt_entries in __d['entries']:
            __i_entries = __i_entries + 1
            append('\012\012\011<tr valign=middle>\012\012\011')
            if __d['editor']:
                append('\012\011  <TD><A href="http://')
                append("""%s""" % __d['servername'])
                append('/cgi-bin/edit-entry?editor=')
                try: append(urllib.quote(__d['editor']))
                except: append(__d['editor'].urllib.quote())
                append('&key=')
                append("""%s""" % __elt_entries['key'])
                append('">[Edit]</A><BR><A href="http://')
                append("""%s""" % __d['servername'])
                append('/cgi-bin/edit-entry?editor=')
                try: append(urllib.quote(__d['editor']))
                except: append(__d['editor'].urllib.quote())
                append('&key=')
                append("""%s""" % __elt_entries['key'])
                append('&zap=1">[Delete]</A>\012\011  <br><font size="-1">Submitter: ')
                try: append(get_email(__elt_entries))
                except: append(__elt_entries.get_email())
                append('</font>\012\011')
            else:
                append('\012\011  ')
                if __d['show_vcal']:
                    append('<TD><A href="http://')
                    append("""%s""" % __d['servername'])
                    append('/cgi-bin/gencal?type=vCalendar&key=')
                    append("""%s""" % __elt_entries['key'])
                    append('"><IMG SRC="/images/vcal.gif" ALIGN="top" ALT="vCal" width="16" height="16" border="0"></A>')
                append('\012\011')
            append('\012\012    ')
            try: append(html(__elt_entries))
            except: append(__elt_entries.html())
            append('\012\012    ')
            if (__i_entries == 0):
                append('\012\011')
                if __d['display_static']:
                    append('\012\011<td width="25%" valign=top rowspan=')
                    append("""%s""" % len(__d['entries']))
                    append('>\012\011    ')
                    if __d['static_info']:
                        append('\012\011\011  <table width="100%" cellpadding=2>\012\011\011  <tr><th><font size="+1"><a name="staticinfo">Notes Index</a></font></th></tr>\012\011\011  </table>\012\012\011\011')
                        if __d['stperformers']:
                            append('\012\011\011  <table width="100%" cellpadding=2>\012\011\011  <tr><th><a name="stperf">Performers</a></th></tr>\012\011\011  <tr><td textcolor="black" bgcolor="white">\012\011\011    <font size="-1">\012\011\011    ')
                            append("""%s""" % __d['stperformers'])
                            append('\012\011\011    </font>\012\011\011    </td></tr>\012\011\011  </table>\012\011\011')
                        append('\012\012\011\011')
                        if __d['stcontributors']:
                            append('\012\011\011  <table width="100%" cellpadding=2>\012\011\011  <tr><th>Contributors</th></tr>\012\011\011  <tr><td textcolor="black" bgcolor="white">\012\011\011    <font size="-1">\012\011\011    ')
                            append("""%s""" % __d['stcontributors'])
                            append('\012\011\011    </font>\012\011\011    </td></tr>\012\011\011  </table>\012\011\011')
                        append('\012\012\011\011')
                        if __d['stvenues']:
                            append('\012\011\011  <table width="100%" cellpadding=2>\012\011\011  <tr><th><a name="stven">Venues</a></th></tr>\012\011\011  <tr><td textcolor="black" bgcolor="white">\012\011\011    <font size="-1">\012\011\011    ')
                            append("""%s""" % __d['stvenues'])
                            append('\012\011\011    </font>\012\011\011    </td></tr>\012\011\011  </table>\012\011\011')
                        append('\012\012\011\011')
                        if __d['stevents']:
                            append('\012\011\011  <table width="100%" cellpadding=2>\012\011\011  <tr><th><a name="stev">Events</a></th></tr>\012\011\011  <tr><td textcolor="black" bgcolor="white">\012\011\011    <font size="-1">\012\011\011    ')
                            append("""%s""" % __d['stevents'])
                            append('\012\011\011    </font>\012\011\011    </td></tr>\012\011\011  </table>\012\011\011')
                        append('\012\012\011\011')
                        if __d['stcities']:
                            append('\012\011\011  <table width="100%" cellpadding=2>\012\011\011  <tr><th><a name="stcity">Cities</a></th></tr>\012\011\011  <tr><td textcolor="black" bgcolor="white">\012\011\011    <font size="-1">\012\011\011    ')
                            append("""%s""" % __d['stcities'])
                            append('\012\011\011    </font>\012\011\011    </td></tr>\012\011\011  </table>\012\011\011')
                        append('\012\012\011\011<font size="-1">You can <strong><a\012\011\011href="http://www.musi-cal.com/addstatic.shtml">add</a></strong> or\012\011\011<strong><a href="http://www.musi-cal.com/editstatic.shtml">edit</a></strong>\012\011\011Notes Index items too!</font>\012\012\011    ')
                    append('\012\011    <p>\012\011    <a href="http://www.musi-cal.com/cgi-bin/redirect/http%3a//www.amazon.com/exec/obidos/redirect-home/musical">\012\011    <img src="http://www.musi-cal.com/icons/amzn-assoc.gif" border="0" alt="In Association with Amazon.com"></a> \012\011</td>\012\011')
                append('\012    ')
            append('\012\012    ')
        append('\012\012    </TABLE>\012\012    ')
        if __d['next']:
            append('\012    <font size="-1">\012      <a href="')
            append("""%s""" % __d['next'])
            append('">[Next Page]</a>\012    </font>\012    ')
        append('\012    ')
        if __d['prev']:
            append('\012    <font size="-1">\012      <a href="')
            append("""%s""" % __d['prev'])
            append('">[Previous Page]</a><br>\012    </font>\012    ')
        else:
            append('\012      ')
            if __d['next']:
                append('<br>')
            append('\012    ')
        append('\012\012    <P>Check with venues before making plans to attend listed events.\012\012    <P>If you notice any errors, <A\012    href="mailto:concertmaster@musi-cal.com">please let us know</A> so we can\012    correct them.\012\012')
    else:
        append('\012\012    ')
        if __d['searchbyperformer']:
            append('\012      <p>We were not able to find any matches to your search.  It\'s possible\012      the artist you were searching for is not on tour or that we don\'t\012      currently have their dates in our database. You might also try <a\012      href="http://')
            append("""%s""" % __d['servername'])
            append('/cgi-bin/list-perf">browsing by performer</a>\012      in case you\'ve misspelled their name.\012      If you know of a possible source for their concert schedule,\012      <a href="mailto:concertmaster@musi-cal.com">let us know</a>.\012\012      ')
            if __d['stperformers']:
                append('\012\011<table width="100%" cellpadding=2>\012\011<tr><th><a name="stperf">Performers</a></th></tr>\012\011<tr><td textcolor="black" bgcolor="white">\012\011  <font size="-1">\012\011  ')
                append("""%s""" % __d['stperformers'])
                append('\012\011  </font>\012\011  </td></tr>\012\011</table>\012     ')
            append('\012\012   ')
        else:
            append('\012      ')
            if __d['searchbyvenue']:
                append('\012\011<p>We were not able to find any matches to your search.  Venue\012\011names can often spelled several different ways.  You might try\012\011searching for just a single word in the venue\'s name, such as\012\011<strong>Passim</strong> instead of <strong>Club Passim</strong>.\012\011Also, some venues, such as church-sponsored coffeehouses, may\012\011be listed either under the name of the coffeehouse or the name\012\011of the hosting venue.\012\011We may also not have their schedule in our database.\012\011If you know of a possible source for their concert schedule,\012\011<a href="mailto:concertmaster@musi-cal.com">let us know</a>.\012      ')
            else:
                append('\012\011')
                if __d['searchbyevent']:
                    append('\012\011  <p>We were not able to find any matches to your search.\012\011  If you know of a possible source for event schedules,\012\011  <a href="mailto:concertmaster@musi-cal.com">let us know</a>.\012\011')
                else:
                    append('\012\011  ')
                    if __d['searchbylocation']:
                        append('\012      <p>We were not able to find any matches to your search.\012      Please check that the city, state and country names are\012      spelled or abbreviated properly.  You might also try <a\012      href="http://')
                        append("""%s""" % __d['servername'])
                        append('/index-geog.shtml">browsing by location</a>\012      in case you misspelled the name. If you know of a\012      possible source of concert schedules for this region, <a\012      href="mailto:concertmaster@musi-cal.com">let us know</a>.\012\011  ')
                    else:
                        append('\012      <p>We were not able to find any matches to your search.\012      If you know of a possible source of concert schedules\012      that would fill a hole in our database, <a\012      href="mailto:concertmaster@musi-cal.com">let us know</a>.\012\011  ')
                    append('\012\011')
                append('\012      ')
            append('\012    ')
        append('\012\012')
    append('\012\012<P ALIGN="center"><CENTER>\012<p>\012<FORM METHOD="GET" ACTION="http://')
    append("""%s""" % __d['servername'])
    append('/cgi-bin/simple-search">\012<input type=radio name="key" value="performer">performer\012<input type=radio checked name="key" value="city">city\012<input type=radio name="key" value="venue">venue\012<input type=radio name="key" value="event">event<br>\012Search for: <INPUT TYPE="text" NAME="value">\012<input type="submit" value="OK">\012<br>')
    append("""%s""" % __d['survey'])
    append('\012</FORM></CENTER>\012\012\012<p align=center>\012<strong><font size="-1"><a href="/index-contribute.shtml"\012onMouseOver="window.status=\'Add events or Notes Index entries to the database\'; return true">Contribute</a> | <a href="/index-search.shtml"\012onMouseOver="window.status=\'Search the database in several ways\'; return true">Search</a> | <a href="/index-sponsor.shtml"\012onMouseOver="window.status=\'Advertise in Musi-Cal!\'; return true">Advertise</a> | <a href="/index-contents.shtml"\012onMouseOver="window.status=\'Let your mousey to the walking...\';return true">Contents</a> | <a href="/index-learn.shtml"\012onMouseOver="window.status=\'Stuck? Get help here...\'; return true">Help</a> | <a href="/Faq.shtml"\012onMouseOver="window.status=\'You got questions? We got answers...\'; return true">FAQ</a></font></strong>\012\012\012<p align=center>\012<A HREF="http://www.musi-cal.com/"><IMG border=0 WIDTH=23 HEIGHT=33 SRC="http://www.musi-cal.com/icons/not!
e.gif" ALT="[Musi-Cal Home Page]"></A>\012\012<font size="-1"><a href="mailto:concertmaster@musi-cal.com">Contact Us!</a><br><em>Copyright &#169; 1994-1999, Automatrix, Inc.</em>\012</font>\012\012\012\012</BODY>\012</HTML>\012')
    return string.join(output, '')


Follow-Ups:

Date: 2000-Oct-05 07:54
By: gvanrossum

Comment:
After 2.0 final is released.
-------------------------------------------------------

Date: 2000-Oct-05 15:26
By: tim_one

Comment:
Changed Category to Library, Summary to get "reindent" out of it, Group to Platform-specific, assigned to /F, and raised the Priority.

Skip, tell us which platform you're using.  It doesn't fail on Windows.  Also confirm that this is purely a problem with tokenize.py (e.g., does your test program also fail when run thru tabnanny.py, and/or checkappend.py?).

Since it doesn't fail for me, I can't whittle it down.  Strongly *suspect* it's due to your long string literals.  tokenize.py uses naive regexps for matching string literals, with bad performance characteristics.  If I were on a platform that failed, I'd

a) First whittle this down to a one-liner (pick on the line with the longest string literal).

b) Replace tokenize.py's string-literal regexps with the ones IDLE uses (which worked fine even on multimegabyte string literals, & much faster, in pre).

That wouldn't solve the underlying recursion limit problem, but would make it much harder to provoke.
-------------------------------------------------------

Date: 2000-Oct-05 18:45
By: montanaro

Comment:
Platform: Mandrake Linux 7.1
Python: latest CVS sources
One liner that fails is at

    http://www.musi-cal.com/~skip/srelimit.py

That fails when run through tabnanny or reindent

I can't see how to replace the tokenize string re's with IDLE's.  reindent and tabnanny both barf on

    pseudomatch = pseudoprog.match(line, pos)

As far as I can tell, PseudoToken (which is the re compiled into pseudoprog) doesn't appear to have much to do with matching strings other than recognizing the start of a triple-quoted string, which the problematic source file doesn't contain.



-------------------------------------------------------

Date: 2000-Oct-05 22:40
By: tim_one

Comment:
Thanks, Skip!  That confirms everything I suspected.

As to the connection to strings, pseudoprog *does* try to match "the first line" of single-quoted strings (via ContStr), which in your case is your entire 1200+ character single-quoted string literal.  That's where, e.g., the

([rR]?'(\\.|[^\n'\\])*('|\\\r?\n)

alternative in your pseudoprog dump comes from, and that's a terrible (for pragmatic reasons) regexp for matching long strings.  The IDLE-like alternative would be (re.VERBOSE)

(
[rR]?
'
[^\n'\\]*
(?: \\. [^\n'\\]*)*
(?: ' | \\ \r? \n)
)

This is just a vanilla instance of Friedl's

normal* (special normal*)*

"regexp unrolling", and saves your bacon for reasons spelled out in detail in his book.

-------------------------------------------------------

Date: 2000-Oct-06 22:13
By: tim_one

Comment:
Skip, please update and try this again.  I replaced tokenize.py's string regexps, and with a bit of luck that will sidestep your problem and we can get this bug off /F's back.
-------------------------------------------------------

For detailed info, follow this link:
http://sourceforge.net/bugs/?func=detailbug&bug_id=116136&group_id=5470