RegExp performance?
Gabriel Genellina
gagsl-py2 at yahoo.com.ar
Sun Feb 25 04:26:33 EST 2007
En Sun, 25 Feb 2007 05:21:49 -0300, Christian Sonne <FreakCERS at gmail.com>
escribió:
> Long story short, I'm trying to find all ISBN-10 numbers in a multiline
> string (approximately 10 pages of a normal book), and as far as I can
> tell, the *correct* thing to match would be this:
> ".*\D*(\d{10}|\d{9}X)\D*.*"
Why the .* at the start and end? You dont want to match those, and makes
your regexp slow.
You didn't tell how exactly a ISBN-10 number looks like, but if you want
to match 10 digits, or 9 digits followed by an X:
reISBN10 = re.compile("\d{10}|\d{9}X")
That is, just the () group in your expression. But perhaps this other one
is better (I think it should be faster, but you should measure it):
reISBN10 = re.compile("\d{9}[\dX]")
("Nine digits followed by another digit or an X")
> if I change this to match ".*[ ]*(\d{10}|\d{9}X)[ ]*.*" instead, I risk
> loosing results, but it runs in about 0.3 seconds
Using my suggested expressions you might match some garbage, but not loose
anything (except two ISBN numbers joined together without any separator in
between). Assuming you have stripped all the "-", as you said.
> So what's the deal? - why would it take so long to run the correct one?
> - especially when a slight modification makes it run as fast as I'd
> expect from the beginning...
Those .* make the expression match a LOT of things at first, just to
discard it in the next step.
--
Gabriel Genellina
More information about the Python-list
mailing list