re.match() performance

Thu Dec 18 20:48:59 EST 2008

On Thu, 18 Dec 2008 05:51:33 -0800, Emanuele D'Arrigo wrote:

> I've written the code below to test the differences in performance
...
> ## TIMED FUNCTIONS
> startTime = time.clock()
> for i in range(0, numberOfRuns):
>     re.match(pattern, longMessage)
> patternMatchingTime = time.clock() - startTime
...

You probably don't need to re-invent the wheel. See the timeit module. In 
my opinion, the best idiom for timing small code snippets is:

from timeit import Timer
t = Timer("func(arg)", "from __main__ import func, arg")
time_taken = min(t.repeat(number=N))/N

where N will depend on how patient you are, but probably shouldn't be 
less than 100. For small enough code snippets, the default of 1000000 is 
recommended. 

For testing re.match, I didn't have enough patience for one million 
iterations, so I used ten thousand.

My results were:

>>> t1 = Timer("re.match(pattern, longMessage)", 
... "from __main__ import pattern, re, compiledPattern, longMessage")
>>> t2 = Timer("compiledPattern.match(longMessage)", 
... "from __main__ import pattern, re, compiledPattern, longMessage")
>>> t1.repeat(number=10000)
[3.8806509971618652, 3.4309241771697998, 4.2391560077667236]
>>> t2.repeat(number=10000)
[3.5613579750061035, 2.725193977355957, 2.936690092086792]

which were typical over a few runs. That suggests that even with no 
effort made to defeat caching, using pre-compiled patterns is 
approximately 20% faster than re.match(pattern).

However, over 100,000 iterations that advantage falls to about 10%. Given 
that each run took about 30 seconds, I suspect that the results are being 
contaminated with some other factor, e.g. networking events or other 
processes running in the background. But whatever is going on, 10% or 
20%, pre-compiled patterns are slightly faster even with caching -- 
assuming of course that you don't count the time taken to compile it in 
the first place.

-- 
Steven