Regex Speed

Gabriel Genellina gagsl-py at yahoo.com.ar
Tue Feb 20 22:28:08 EST 2007


En Tue, 20 Feb 2007 21:40:40 -0300, <garrickp at gmail.com> escribió:

> My apologies. I don't have specifics right now, but it's something
> along the line of this:
>
> error_list = re.compile(r"error|miss|issing|inval|nvalid|math")
>
> Yes, I know, these are not re expressions, but the requirements for
> the script specified that the error list be capable of accepting
> regular expressions, since these lists are configurable.

Can you relax that restriction? Not always a regex is a good way,  
specially if you want speed also:

py> import timeit
py> line = "a sample line that will not match any condition, but long  
enough to
be meaninful in the context of this problem, or at least I thik so. This  
has 174
  characters, is it enough?"
py> timeit.Timer('if error_list.search(line): pass',
...   'import  
re;error_list=re.compile(r"error|miss|issing|inval|nvalid|math");f
rom __main__ import line').repeat(number=10000)
[1.7704239587925394, 1.7289717746328725, 1.7057590543605246]
py> timeit.Timer('for token in tokens:\n\tif token in line: break\nelse:  
pass',
...   'from __main__ import line;tokens =  
"error|miss|issing|inval|nvalid|math".
split("|")').repeat(number=10000)
[1.0268617863829661, 1.050040144755787, 1.0677314944409151]
py> timeit.Timer('if "error" in line or "miss" in line or "issing" in line  
or "i
nval" in line or "nvalid" in line or "math" in line: pass',
...   'from __main__ import line').repeat(number=10000)
[0.97102286155842066, 0.98341158348013913, 0.9651561957857222]

The fastest was is hard coding the tokens: if "error" in line or "miss" in  
line or...
If that is not acceptable, iterating over a list of tokens: for token in  
token: if token in line...
The regex is the slowest, a more carefully crafted regex is a bit faster,  
but not enough:

py> timeit.Timer('if error_list.search(line): pass',
...   'import  
re;error_list=re.compile(r"error|m(?:iss(?:ing)|ath)|inval(?:id)")
;from __main__ import line').repeat(number=10000)
[1.3974029108719606, 1.4247005067123837, 1.4071600141470526]

-- 
Gabriel Genellina




More information about the Python-list mailing list