Regular Expressions - Python vs Perl
Roy Smith
roy at panix.com
Fri Apr 22 08:37:51 EDT 2005
iny+news at iki.fi (Ilpo Nyyssönen) wrote:
> Of course it caches those when running. The point is that it needs to
> recompile every time you have restarted the program. With short lived
> command line programs this really can be a problem.
Are you speculating that it might be a problem, or saying that you have
seen it be a problem in a real-life program?
I just generated a bunch of moderately simple regexes from a dictionary
wordlist. Looks something like:
Roy-Smiths-Computer:play$ head exps
a.*a[0-9]{34}
a.*ah[0-9]{34}
a.*ahed[0-9]{34}
a.*ahing[0-9]{34}
a.*ahs[0-9]{34}
a.*al[0-9]{34}
a.*alii[0-9]{34}
a.*aliis[0-9]{34}
a.*als[0-9]{34}
a.*ardvark[0-9]{34}
Then I ran them through a little script that does:
for exp in sys.stdin.readlines():
regex = re.compile (exp)
and timed it for various numbers of lines. On my G4 Powerbook (1 GHz
PowerPC), I'm compiling about 1000 regex's per second:
Roy-Smiths-Computer:play$ time head -5000 < exps | ./regex.py
real 0m5.208s
user 0m4.690s
sys 0m0.090s
So, my guess is that unless you're compiling 100's of regexes each time you
start up, the one-time compilation costs are probably not significant.
> And yes, I have read the source of sre.py and I have made an ugly
> module that digs the compiled data and pickles it to a file and then
> in next startup it reads that file and puts the stuff back to the
> cache.
That's exactly what I would have done if I really needed to improve startup
speed. In fact, I did something like that many moons ago, in a previous
life. See R. Smith, "A finite state machine algorithm for finding
restriction sites and other pattern matching applications", CABIOS, Vol 4,
no. 4, 1988. In that case, I had about 1200 patterns I was searching for
(and doing it on hardware running about 1% of the speed of my current
laptop).
BTW, why did you have to dig out the compiled data before pickling it?
Could you not have just pickled whatever re.compile() returned?
More information about the Python-list
mailing list