Regular Expressions - Python vs Perl

Ilpo Nyyssönen iny+news at iki.fi
Fri Apr 22 11:16:21 EDT 2005


Roy Smith <roy at panix.com> writes:

> iny+news at iki.fi (Ilpo Nyyssönen) wrote:
>> Of course it caches those when running. The point is that it needs to
>> recompile every time you have restarted the program. With short lived
>> command line programs this really can be a problem.
>
> Are you speculating that it might be a problem, or saying that you have 
> seen it be a problem in a real-life program?

Well, it depends, I might say yes. I have a calendar app with command
line user interface. There the use is like this: "view, add, view,
edit, view, ..." and those are separate command invocations. In that
case a second in startup speed can be a long time. And I did use the
profiler and it did show the sre compiling to be the slowest thing.

Nowdays I use libxml2-python as the XML parser and so the problem is
not so acute anymore. (That is just harder to get in running for
python compiled from source outside the rpm system and it is not so
easy to use via DOM interface.)

> I just generated a bunch of moderately simple regexes from a dictionary 
> wordlist.  Looks something like:

[...]

> So, my guess is that unless you're compiling 100's of regexes each time you 
> start up, the one-time compilation costs are probably not significant.

Well, as I said, I did get it to be the worst in profiler when using
PyXML/xmlproc.

>> And yes, I have read the source of sre.py and I have made an ugly
>> module that digs the compiled data and pickles it to a file and then
>> in next startup it reads that file and puts the stuff back to the
>> cache.
>
> That's exactly what I would have done if I really needed to improve startup 
> speed.  In fact, I did something like that many moons ago, in a previous 
> life.  See R. Smith, "A finite state machine algorithm for finding 
> restriction sites and other pattern matching applications", CABIOS, Vol 4, 
> no. 4, 1988.  In that case, I had about 1200 patterns I was searching for 
> (and doing it on hardware running about 1% of the speed of my current 
> laptop).

The problem is that it is not so easy to get ALL of the regexps dumped
in that way.

> BTW, why did you have to dig out the compiled data before pickling it?  
> Could you not have just pickled whatever re.compile() returned?

Because it dumps the original regexp and then compiles it when loading.

-- 
Ilpo Nyyssönen # biny # /* :-) */



More information about the Python-list mailing list