Bottleneck? More efficient regular expression?

Andrew Dalke adalke at mindspring.com
Thu Sep 25 22:20:43 EDT 2003


Tina Li:
> The lag is *perceivable* (this is what I meant; sorry) by a human user so
it's slower.

Yup, that's what I meant.  Too many people make theoretical
arguments for why to choose one (complicated) approach
over a simpler one on the basis of performance, when it turns
out performance isn't the issue.  My appreciation goes out to you
for doing it the right way.

You may also want to look at pyRXP from ReportLab.
However, there seems to be some drastic problems on their
site -- links on reportlab.com fail and reportlab.org goes
to pair.com's site placeholder page.

It's a very fast XML parser for Python.

> I in fact tried that before but the over-limit error still happened. So
it's
> not just the non-greedy .*? that's causing the problem. Hmm.

No, I don't think it is.  The stack space increases by one for
each ambiguity and the .*? should only produce one ambiguity.
Usually there's a stack problem only if you have an ambiguity
or empty match inside a repeat, and I didn't see that in your
pattern.

If you get really interested in tracking this down, you might look
around for some of the GUI regexp debugging tools.  There's
one in ActiveState's product, as I recall.  Err, but it's based on
Perl's regexp parser and won't handle (?P<>)

(I do have an experimental pure-Python regexp engine that
I would offer for debugging, but it doesn't handle .*? yet and
needs a rewrite before it does.)

> It only handles tags without space because all tags are
> guaranteed to be generated without space.

Sure.  All I was saying was that if you're going to code for
a specific layout then you don't need to be as general.

You might even consider using "[^\n]*\n{5}" if you just
want to skip 5 lines.

                    Andrew
                    dalke at dalkescientific.com
P.S.
  If you are doing anything open-sourceish, or using
open source in bioinformatics, structural biology, and
related fields, and will be at ISMB in Edinborough next
year, you might consider attending the Bioinformatics
Open Source Conference.






More information about the Python-list mailing list