Terrible regular expression performance

Sat Nov 4 09:49:03 EST 2000

On Wed, 01 Nov 2000 18:18:54 GMT, Kevin.Smith at sas.com (Kevin Smith)
wrote:

>I am trying to do a very simple substitution in an RTF file that is just under 
>1MB.  The substitution is as follows:
>
>rtf = re.sub(r'(\s*\n\s*)+', r'', rtf)
>rtf = re.sub(r'([{}\\])', r'\n\1', rtf)
>
>All this does is initially remove all newlines, then replace any occurrence of 
>'{', '}', '\\' with a newline followed by that character.
>
>Python cannot successfully complete this operation.  It just eats up all of 
>the memory in the machine (>128MB) until the program crashes.  I modified the 
>"re" module's subn routine to use a string rather than a list to store the 
>results string.  This helped.  The program finished in about 15 minutes and 
>used just under 7MB of memory.
>
>Implementing the same substitution in Perl ran in just a few seconds and a few 
>MB of memory.  This is a dramatic difference and is rather disturbing.  I've 
>never seen Perl outdo Python by this margin.  Is there any way to improve 
>Python's performance here?  And, what is the benefit of using a list rather 
>than a string to store the results string in the "re" module? 
>

See the article "An Optimization Anectdote" at
http://www.python.org/doc/essays/ 

>Kevin Smith
>Kevin.Smith at sas.com