RegExp performance?

Sun Feb 25 04:26:33 EST 2007

En Sun, 25 Feb 2007 05:21:49 -0300, Christian Sonne <FreakCERS at gmail.com>  
escribió:

> Long story short, I'm trying to find all ISBN-10 numbers in a multiline
> string (approximately 10 pages of a normal book), and as far as I can
> tell, the *correct* thing to match would be this:
> ".*\D*(\d{10}|\d{9}X)\D*.*"

Why the .* at the start and end? You dont want to match those, and makes  
your regexp slow.
You didn't tell how exactly a ISBN-10 number looks like, but if you want  
to match 10 digits, or 9 digits followed by an X:
reISBN10 = re.compile("\d{10}|\d{9}X")
That is, just the () group in your expression. But perhaps this other one  
is better (I think it should be faster, but you should measure it):
reISBN10 = re.compile("\d{9}[\dX]")
("Nine digits followed by another digit or an X")

> if I change this to match ".*[ ]*(\d{10}|\d{9}X)[ ]*.*" instead, I risk
> loosing results, but it runs in about 0.3 seconds

Using my suggested expressions you might match some garbage, but not loose  
anything (except two ISBN numbers joined together without any separator in  
between). Assuming you have stripped all the "-", as you said.

> So what's the deal? - why would it take so long to run the correct one?
> - especially when a slight modification makes it run as fast as I'd
> expect from the beginning...

Those .* make the expression match a LOT of things at first, just to  
discard it in the next step.

-- 
Gabriel Genellina