Terrible regular expression performance

Wed Nov 1 13:18:54 EST 2000

I am trying to do a very simple substitution in an RTF file that is just under 
1MB.  The substitution is as follows:

rtf = re.sub(r'(\s*\n\s*)+', r'', rtf)
rtf = re.sub(r'([{}\\])', r'\n\1', rtf)

All this does is initially remove all newlines, then replace any occurrence of 
'{', '}', '\\' with a newline followed by that character.

Python cannot successfully complete this operation.  It just eats up all of 
the memory in the machine (>128MB) until the program crashes.  I modified the 
"re" module's subn routine to use a string rather than a list to store the 
results string.  This helped.  The program finished in about 15 minutes and 
used just under 7MB of memory.

Implementing the same substitution in Perl ran in just a few seconds and a few 
MB of memory.  This is a dramatic difference and is rather disturbing.  I've 
never seen Perl outdo Python by this margin.  Is there any way to improve 
Python's performance here?  And, what is the benefit of using a list rather 
than a string to store the results string in the "re" module? 

Kevin Smith
Kevin.Smith at sas.com