RegExp performance?

Christian Sonne FreakCERS at gmail.com
Sun Feb 25 03:21:49 EST 2007


Long story short, I'm trying to find all ISBN-10 numbers in a multiline 
string (approximately 10 pages of a normal book), and as far as I can 
tell, the *correct* thing to match would be this:
".*\D*(\d{10}|\d{9}X)\D*.*"

(it should be noted that I've removed all '-'s in the string, because 
they have a tendency to be mixed into ISBN's)

however, on my 3200+ amd64, running the following:

reISBN10 = re.compile(".*\D*(\d{10}|\d{9}X)\D*.*")
isbn10s = reISBN10.findall(contents)

(where contents is the string)

this takes about 14 minutes - and there are only one or two matches...

if I change this to match ".*[ ]*(\d{10}|\d{9}X)[ ]*.*" instead, I risk 
loosing results, but it runs in about 0.3 seconds

So what's the deal? - why would it take so long to run the correct one? 
- especially when a slight modification makes it run as fast as I'd 
expect from the beginning...


I'm sorry I cannot supply test data, in my case, it comes from 
copyrighted material - however if it proves needed, I can probably 
construct dummy data to illustrate the problem


Any and all guidance would be greatly appreciated,
kind regards
Christian Sonne

PS: be gentle - it's my first post here :-)



More information about the Python-list mailing list