RegExp performance?
Christian Sonne
FreakCERS at gmail.com
Sun Feb 25 03:21:49 EST 2007
Long story short, I'm trying to find all ISBN-10 numbers in a multiline
string (approximately 10 pages of a normal book), and as far as I can
tell, the *correct* thing to match would be this:
".*\D*(\d{10}|\d{9}X)\D*.*"
(it should be noted that I've removed all '-'s in the string, because
they have a tendency to be mixed into ISBN's)
however, on my 3200+ amd64, running the following:
reISBN10 = re.compile(".*\D*(\d{10}|\d{9}X)\D*.*")
isbn10s = reISBN10.findall(contents)
(where contents is the string)
this takes about 14 minutes - and there are only one or two matches...
if I change this to match ".*[ ]*(\d{10}|\d{9}X)[ ]*.*" instead, I risk
loosing results, but it runs in about 0.3 seconds
So what's the deal? - why would it take so long to run the correct one?
- especially when a slight modification makes it run as fast as I'd
expect from the beginning...
I'm sorry I cannot supply test data, in my case, it comes from
copyrighted material - however if it proves needed, I can probably
construct dummy data to illustrate the problem
Any and all guidance would be greatly appreciated,
kind regards
Christian Sonne
PS: be gentle - it's my first post here :-)
More information about the Python-list
mailing list