Terrible regular expression performance
Kevin Smith
Kevin.Smith at sas.com
Wed Nov 1 13:18:54 EST 2000
I am trying to do a very simple substitution in an RTF file that is just under
1MB. The substitution is as follows:
rtf = re.sub(r'(\s*\n\s*)+', r'', rtf)
rtf = re.sub(r'([{}\\])', r'\n\1', rtf)
All this does is initially remove all newlines, then replace any occurrence of
'{', '}', '\\' with a newline followed by that character.
Python cannot successfully complete this operation. It just eats up all of
the memory in the machine (>128MB) until the program crashes. I modified the
"re" module's subn routine to use a string rather than a list to store the
results string. This helped. The program finished in about 15 minutes and
used just under 7MB of memory.
Implementing the same substitution in Perl ran in just a few seconds and a few
MB of memory. This is a dramatic difference and is rather disturbing. I've
never seen Perl outdo Python by this margin. Is there any way to improve
Python's performance here? And, what is the benefit of using a list rather
than a string to store the results string in the "re" module?
Kevin Smith
Kevin.Smith at sas.com
More information about the Python-list
mailing list