Terrible regular expression performance
Robert Roy
rjroy at takingcontrol.com
Sat Nov 4 09:49:03 EST 2000
On Wed, 01 Nov 2000 18:18:54 GMT, Kevin.Smith at sas.com (Kevin Smith)
wrote:
>I am trying to do a very simple substitution in an RTF file that is just under
>1MB. The substitution is as follows:
>
>rtf = re.sub(r'(\s*\n\s*)+', r'', rtf)
>rtf = re.sub(r'([{}\\])', r'\n\1', rtf)
>
>All this does is initially remove all newlines, then replace any occurrence of
>'{', '}', '\\' with a newline followed by that character.
>
>Python cannot successfully complete this operation. It just eats up all of
>the memory in the machine (>128MB) until the program crashes. I modified the
>"re" module's subn routine to use a string rather than a list to store the
>results string. This helped. The program finished in about 15 minutes and
>used just under 7MB of memory.
>
>Implementing the same substitution in Perl ran in just a few seconds and a few
>MB of memory. This is a dramatic difference and is rather disturbing. I've
>never seen Perl outdo Python by this margin. Is there any way to improve
>Python's performance here? And, what is the benefit of using a list rather
>than a string to store the results string in the "re" module?
>
See the article "An Optimization Anectdote" at
http://www.python.org/doc/essays/
>Kevin Smith
>Kevin.Smith at sas.com
More information about the Python-list
mailing list