Terrible regular expression performance

Wed Nov 1 13:55:08 EST 2000

Kevin Smith wrote:
> I am trying to do a very simple substitution in an RTF file that is just under
> 1MB.  The substitution is as follows:
>
> rtf = re.sub(r'(\s*\n\s*)+', r'', rtf)
> rtf = re.sub(r'([{}\\])', r'\n\1', rtf)
>
> All this does is initially remove all newlines, then replace any occurrence of
> '{', '}', '\\' with a newline followed by that character.
>
> Python cannot successfully complete this operation.  It just eats up all of
> the memory in the machine (>128MB) until the program crashes.

I cannot repeat this under 2.0, but it's definitely not as
fast as it could be (optimizing sre's "sub" implementation
is high on my todo list for 2.1).  On a slow machine, doing
this on a 800k file takes about 25 seconds (PRE is slightly
slower).

anyway, the culprit seems to be the second substitution.
on my box, the following brute-force solution is about 10
times faster:

    rtf = re.sub(r'(\s*\n\s*)+', r'', rtf)
    for ch in "{}\\":
        rtf = string.replace(rtf, ch, "\n" + ch)

you could also replace the first re.sub with

    # not 100% equivalent, but probably good enough
    rtf = string.join(map(string.strip, string.split(rtf, "\n")), "")

and do the whole thing in just over 0.1 seconds...

> And, what is the benefit of using a list rather than a string
> to store the results string in the "re" module?

you mean:

    L = []
    for x in y:
        L.append(x)
    s = string.join(L, "")

vs.

    s = ""
    for x in y:
        s = s + x

?

it's a performance thing, really...  Python strings are immutable,
so "+" has to create a new string, and copy the two parts into it.
if you have more than two fragments, you'll end up copying the
same data over and over and over again...

</F>