Regexp optimization question
Magnus Lie Hetland
mlh at furu.idi.ntnu.no
Thu Apr 22 16:27:15 EDT 2004
I'm working on a project (Atox) where I need to match quite a few
regular expressions (several hundred) in reasonably large text files.
I've found that this can easily get rather slow. (There are many
things that slow Atox down -- it hasn't been designed for speed, and
any optimizations will entail quite a bit of refactoring.)
I've tried to speed this up by using the same trick as SPARK, putting
all the regexps into a single or-group in a new regexp. That helped a
*lot* -- but now I have to find out which one of them matched at a
certain location. I haven't yet looked at the performance of the code
for checking this, because I encountered a problem before that: Using
named groups will only work for 100 patterns -- not a terrible
problem, since I can create several 100-group patterns -- and using
named groups slows down the matching *a lot*. As far as I could tell,
using named groups actually was slower than simply matching the
patterns one by one.
So: What can I do? Is there any way of getting more speed here, except
implementing the matching code (i.e. the code right around the calls
to _re) in C or Pyrex? (I've tried using Psyco, but that didn't help;
I guess it might help if I implemented things differently...)
Any ideas?
--
Magnus Lie Hetland "Oppression and harassment is a small price to pay
http://hetland.org to live in the land of the free." -- C. M. Burns
More information about the Python-list
mailing list