doing hundreds of re.subs efficiently on large strings

Mon Apr 7 12:34:13 EDT 2003

On Mon, 07 Apr 2003 00:28:13 GMT, nihilo <exnihilo at NOmyrealCAPSbox.com> wrote:

>Bengt Richter wrote:
>> 
>> Thanks for testing, that is interesting. It would be interesting to see
>> the test code you ran, so we could see the reasons for .25 vs .43 (and
>> I could revise my reality-model as necessary ;-)
>> 
>> Regards,
>> Bengt Richter
>
>Hi Bengt,
>
>Here is some code that I created to test the two versions.
>
>Original:
>
[...]
>
>print time.clock() - start
>
>And the modified code is:
>
>import time, re
>start = time.clock()
>data = '' # same data
>dict = { 'US and Britain' : 'Oceania', 'US and British' : 'Oceanian', 
[...~ 135 items ...]

>pat = re.compile('(Canada|the U\.S\.A\.|air force|International Herald 
[... ditto ...]
>strings = pat.split(data)
>for i in xrange(1,len(strings),2):
>     strings[i] = dict[strings[i]]
>data = ''.join(strings)
>
>I just checked these again, and the results were .052 on average for the 
>first version, and .080 on average for the second. The second would 
>obviously be a lot faster if I pre-compiled the pattern and cPickled it, 
I had assumed pre-compiled regex, but I learned something anyway:
The split is a _lot_ slower than I would have thought. I had guessed that
it would have compiled to a fast state machine like flex might generate
for or-ed static string matches, but something else must be happening.
Maybe it was not worth special casing static ors.

The loop to substitute into the list plus joining the result is quite fast,
but the split is 2+ orders of magnitude slower than the latter, and the
regex compilation is somewhat faster than the latter, but not much, apparently.
Something like 175:150:1 for split:regexcomp:subst-in-list-using-dict-and-join
in a test I ran on a 63.5k html file with 100 matching substrings in the data.

>but I wanted to include compilation time in the test.  The fastest 
Why? Just curious -- would the patterns be changing as often as the data?

>version that I have tested is using Alex's replacement suggestion in 
What version of python are you running? Mine wouldn't compile the regex
in the form Alex suggested (with every '|' term separately parenthesized,
like '(x)|(y)' vs '(x|y)' ). Maybe I typoed when I typed the change, but
I got

      File "D:\python22\lib\sre_compile.py", line 442, in compile
        assert p.pattern.groups <= 100,\
    AssertionError: sorry, but this version only supports 100 named groups

I was hoping to compare regex speed using the two forms. Can't pursue it now though...

>combination with pre-compilation and cPickle, but the naive data = 
>data.replace is surprisingly fast.
>
Yes.
BTW, neither of the regex-based methods are equivalent in semantics
to the naive sequence of data.replace statements, because each data.replace
has the possibility (in general, even if maybe not in the actual example)
of matching something created by a previous data.replace.

Thanks for posting the code.

Regards,
Bengt Richter