Inexplicable timings

MRAB google at mrabarnett.plus.com
Sat Aug 2 06:54:58 EDT 2008


On Aug 2, 12:43 am, MRAB <goo... at mrabarnett.plus.com> wrote:
> I'm looking at the implementation of regular expressions in Python and
> wrote a script to test my changes. This is the script:
>
> import re
> import time
>
> base = "abc"
> final = "d"
>
> for n in [100, 1000, 10000]:
>     for f in [final, ""]:
>         for r in ["+", "+?"]:
>             pattern = "(?:%s)%s%s" % (base, r, final)
>             text_format = "(?:%s){%d}%s" % (base, n, f)
>             regex = re.compile(pattern)
>             text = base * n + f
>             start_time = time.clock()
>             found = regex.search(text)
>             finish_time = time.clock()
>             duration = finish_time - start_time
>             print "%s\t:: %s\t:: %.3f" % (text_format, pattern,
> duration)
>
> I've found some puzzling behaviour that I can't explain. If I delete
> sre_parse.pyc or sre_compile.pyc before running the script I get
> timings like this:
>
> (?:abc){100}d    :: (?:abc)+d    :: 0.000
> (?:abc){100}d    :: (?:abc)+?d    :: 0.000
> (?:abc){100}    :: (?:abc)+d    :: 0.000
> (?:abc){100}    :: (?:abc)+?d    :: 0.000
> (?:abc){1000}d    :: (?:abc)+d    :: 0.000
> (?:abc){1000}d    :: (?:abc)+?d    :: 0.000
> (?:abc){1000}    :: (?:abc)+d    :: 0.022
> (?:abc){1000}    :: (?:abc)+?d    :: 0.020
> (?:abc){10000}d    :: (?:abc)+d    :: 0.001
> (?:abc){10000}d    :: (?:abc)+?d    :: 0.000
> (?:abc){10000}    :: (?:abc)+d    :: 1.961
> (?:abc){10000}    :: (?:abc)+?d    :: 1.756
>
> Subsequent runs give timings like this:
>
> (?:abc){100}d    :: (?:abc)+d    :: 0.000
> (?:abc){100}d    :: (?:abc)+?d    :: 0.000
> (?:abc){100}    :: (?:abc)+d    :: 0.000
> (?:abc){100}    :: (?:abc)+?d    :: 0.000
> (?:abc){1000}d    :: (?:abc)+d    :: 0.000
> (?:abc){1000}d    :: (?:abc)+?d    :: 0.000
> (?:abc){1000}    :: (?:abc)+d    :: 0.020
> (?:abc){1000}    :: (?:abc)+?d    :: 0.020
> (?:abc){10000}d    :: (?:abc)+d    :: 0.001
> (?:abc){10000}d    :: (?:abc)+?d    :: 0.000
> (?:abc){10000}    :: (?:abc)+d    :: 3.953
> (?:abc){10000}    :: (?:abc)+?d    :: 1.763
>
> The second-from-last takes twice as long!
>
> This is repeatable.
>
> Python 2.5.2, on the other hand, gives timings like this:
>
> (?:abc){100}d    :: (?:abc)+d    :: 0.000
> (?:abc){100}d    :: (?:abc)+?d    :: 0.000
> (?:abc){100}    :: (?:abc)+d    :: 0.000
> (?:abc){100}    :: (?:abc)+?d    :: 0.000
> (?:abc){1000}d    :: (?:abc)+d    :: 0.000
> (?:abc){1000}d    :: (?:abc)+?d    :: 0.000
> (?:abc){1000}    :: (?:abc)+d    :: 0.044
> (?:abc){1000}    :: (?:abc)+?d    :: 0.027
> (?:abc){10000}d    :: (?:abc)+d    :: 0.002
> (?:abc){10000}d    :: (?:abc)+?d    :: 0.002
> (?:abc){10000}    :: (?:abc)+d    :: 4.547
> (?:abc){10000}    :: (?:abc)+?d    :: 2.858
>
> repeatably, irrespective of whether I delete the .pyc files first.
>
> Why is only one timing affected, and why by so much? Python quits
> after each run, so it can't be anything cumulative, surely. I've
> considered things like caching, for example compiling the .pyc files
> on the first run, but can't see why (or how) that would cause what I
> see.
>
> Can anyone suggest a factor that I might be missing? Any suggestions
> welcome!
>
I've found the cause. It's the memory allocation.

For future reference, I think what's happening is that when Python
generates the .pyc files it's using more memory (grabbing it from the
system) so that when my code needs to allocate Python can do it more
quickly; if Python didn't have to generate the .pyc files then it has
to grab more from the system, which takes longer. Or something like
that.

Anyway, it's fixed. :-)



More information about the Python-list mailing list