RegExp performance?

Sun Feb 25 05:47:59 EST 2007

On Feb 25, 7:21 pm, Christian Sonne <FreakC... at gmail.com> wrote:
> Long story short, I'm trying to find all ISBN-10 numbers in a multiline
> string (approximately 10 pages of a normal book), and as far as I can
> tell, the *correct* thing to match would be this:
> ".*\D*(\d{10}|\d{9}X)\D*.*"

All of those *s are making it work too hard.

Starting with your r".*\D*(\d{10}|\d{9}X)\D*.*" [you do have the
r"..." not just "....", don't you?]

Step 1: Lose the .* off each end -- this is meaningless in the context
of a search() or findall() and would slow the re engine down if it
doesn't optimise it away.

r"\D*(\d{10}|\d{9}X)\D*"

Step 2: I presume that the \D* at each (remaining) end is to ensure
that you don't pick up a number with 11 or more digits. You only need
to test that your presumed ISBN is not preceded/followed by ONE
suspect character. Is ABC1234567890DEF OK? I think not; I'd use \b
instead of \D

r"\b(\d{10}|\d{9}X)\b"

Step 3: Now that we have only \b (which matches 0 characters) at each
end of the ISBN, we can lose the capturing ()

r"\b\d{10}|\d{9}X\b"

Step 4: In the case of 123456789X, it fails on the X and then scans
the 123456789 again -- a tiny waste compared to all the * stuff, but
still worth fixing.

r"\b\d{9}[0-9X]\b"

Give that a whirl and let us know how correct and how fast it is.

>
> (it should be noted that I've removed all '-'s in the string, because
> they have a tendency to be mixed into ISBN's)
>
> however, on my 3200+ amd64, running the following:
>
> reISBN10 = re.compile(".*\D*(\d{10}|\d{9}X)\D*.*")

You should really get into the habit of using raw strings with re.

> isbn10s = reISBN10.findall(contents)
>
> (where contents is the string)
>
> this takes about 14 minutes - and there are only one or two matches...

How many actual matches and how many expected matches?

Note on "and there are only one or two matches": if your data
consisted only of valid ISBNs separated by a single space or comma, it
would run much faster. It is all the quadratic/exponential mucking
about with the in-between bits that slows it down. To demonstrate
this, try timing dummy data like "1234567890 " * 1000000 and
"123456789X " * 1000000 with your various regexes, and with the step1,
step2 etc regexes above.

>
> if I change this to match ".*[ ]*(\d{10}|\d{9}X)[ ]*.*" instead, I risk
> loosing results, but it runs in about 0.3 seconds
>
> So what's the deal? - why would it take so long to run the correct one?

Because you have .*\D* in other words 0 or more occurrences of almost
anything followed by 0 or more occurrences of almost anything. Even
assuming it ignores the .*, it will find the longest possible sequence
of non-digits, then try to match the ISBN stuff. If it fails, it will
shorten that sequence of non-digits, try the ISBN stuff again, etc etc
until it matches the ISBN stuff or that sequence of non-digits is down
to zero length. It will do that for each character position in the
file contents. Why is it faster when you change \D to []? Presumably
because in your data, sequences of non-digits are longer than
sequences of spaces IOW there is less stuff to back-track over.

> - especially when a slight modification makes it run as fast as I'd
> expect from the beginning...
>
> I'm sorry I cannot supply test data, in my case, it comes from
> copyrighted material - however if it proves needed, I can probably
> construct dummy data to illustrate the problem

What you call "dummy data", I'd call "test data". You should make a
sample of "dummy data" and test that your regex is (a) finding all
ISBNs that it should (b) not reporting incorrect matches, *BEFORE*
being concerned with speed.

>
> Any and all guidance would be greatly appreciated,

For some light reading :-) borrow a copy of Friedl's book (mentioned
in the Python re docs) and read the parts on how backtracking regex
engines work.

HTH,
John