doing hundreds of re.subs efficiently on large strings

Thu Mar 27 06:33:11 EST 2003

Anders J. Munch wrote:

> "nihilo" <exnihilo at NOmyrealCAPSbox.com> wrote:
>> How do I use finditer to find the unmatched parts also (the stuff
>> between the matches)? It looks to me like it just gives you access to
>> the stuff that is matched, but maybe I'm missing something.
> 
> Look at start() and end().
> 
>>>> s = 'hi nihilo'
>>>> for m in re.finditer('(n)|(o)', s):
> ...     print m.start(), m.end()
> ...
> 3 4
> 8 9
>>>> s[:3], s[3:4], s[4:8], s[8:9], s[9:]
> ('hi ', 'n', 'ihil', 'o', '')
>>>> 

Yeah, but having to keep track of "one after the latest .end" in
the loop to accumulate the fragments is fussy and may be a bit
error-prone.  I'm not sure performance advantages justify that.
Let's check -- performance should ALWAYS be measured, NEVER
guessed at...!!!  Here's a module ms.py with both approaches:

import re

# simulate a copy of dozen needed substitutions
from string import lowercase
uppercase = lowercase.upper()
# say allres is a list of RE's (without groups!), and
# allsub a parallel list of strings to substitute for
# each corresponding RE when it matches
allres = list(lowercase)
allsub = ['(%s!)'%c for c in allres]

# a biggish string, about 10K char, on which to do substitutions
bigstr = (lowercase+uppercase) * 200

# join the many REs (which have no groups) into one big RE
bigre = re.compile( '(' + ')|('.join(allres) + ')' )

def sub_1(bigre=bigre, bigstr=bigstr, allsub=allsub):
    "substitutions with delicate, tricky approach"
    pieces = []
    prev = 0
    for m in bigre.finditer(bigstr):
        pieces.append(bigstr[prev:m.start()])
        pieces.append(allsub[m.lastindex-1])
        prev = m.end()
    pieces.append(bigstr[prev:])
    return ''.join(pieces)

def sub_2(bigre=bigre, bigstr=bigstr, allsub=allsub):
    "substitutions with simple, straightforward approach"
    def onesub(m):
        return allsub[m.lastindex-1]
    return bigre.sub(onesub,bigstr)

assert sub_1() == sub_2()

And the winner is...:

[alex at lancelot alex]$ python timeit.py -s 'import ms' 'ms.sub_1()'
10 loops, best of 3: 1.28e+05 usec per loop
[alex at lancelot alex]$ python timeit.py -s 'import ms' 'ms.sub_2()'
10 loops, best of 3: 9.92e+04 usec per loop

...the simple, straightforward approach, with 99 milliseconds per
loop, versus 128 milliseconds per loop for the fussy, tricky one!

Don't guess -- MEASURE...!!!

Alex