RegExp performance?

Sun Feb 25 13:29:06 EST 2007

John Machin wrote:
> On Feb 25, 7:21 pm, Christian Sonne <FreakC... at gmail.com> wrote:
>> Long story short, I'm trying to find all ISBN-10 numbers in a multiline
>> string (approximately 10 pages of a normal book), and as far as I can
>> tell, the *correct* thing to match would be this:
>> ".*\D*(\d{10}|\d{9}X)\D*.*"
> 
> All of those *s are making it work too hard.
> 
> Starting with your r".*\D*(\d{10}|\d{9}X)\D*.*" [you do have the
> r"..." not just "....", don't you?]
> 
> Step 1: Lose the .* off each end -- this is meaningless in the context
> of a search() or findall() and would slow the re engine down if it
> doesn't optimise it away.
> 
> r"\D*(\d{10}|\d{9}X)\D*"
> 
> Step 2: I presume that the \D* at each (remaining) end is to ensure
> that you don't pick up a number with 11 or more digits. You only need
> to test that your presumed ISBN is not preceded/followed by ONE
> suspect character. Is ABC1234567890DEF OK? I think not; I'd use \b
> instead of \D
> 
> r"\b(\d{10}|\d{9}X)\b"
> 
> Step 3: Now that we have only \b (which matches 0 characters) at each
> end of the ISBN, we can lose the capturing ()
> 
> r"\b\d{10}|\d{9}X\b"
> 
> Step 4: In the case of 123456789X, it fails on the X and then scans
> the 123456789 again -- a tiny waste compared to all the * stuff, but
> still worth fixing.
> 
> r"\b\d{9}[0-9X]\b"
> 
> Give that a whirl and let us know how correct and how fast it is.
> 
>> (it should be noted that I've removed all '-'s in the string, because
>> they have a tendency to be mixed into ISBN's)
>>
>> however, on my 3200+ amd64, running the following:
>>
>> reISBN10 = re.compile(".*\D*(\d{10}|\d{9}X)\D*.*")
> 
> You should really get into the habit of using raw strings with re.
> 
>> isbn10s = reISBN10.findall(contents)
>>
>> (where contents is the string)
>>
>> this takes about 14 minutes - and there are only one or two matches...
> 
> How many actual matches and how many expected matches?
> 
> Note on "and there are only one or two matches": if your data
> consisted only of valid ISBNs separated by a single space or comma, it
> would run much faster. It is all the quadratic/exponential mucking
> about with the in-between bits that slows it down. To demonstrate
> this, try timing dummy data like "1234567890 " * 1000000 and
> "123456789X " * 1000000 with your various regexes, and with the step1,
> step2 etc regexes above.
> 
>> if I change this to match ".*[ ]*(\d{10}|\d{9}X)[ ]*.*" instead, I risk
>> loosing results, but it runs in about 0.3 seconds
>>
>> So what's the deal? - why would it take so long to run the correct one?
> 
> Because you have .*\D* in other words 0 or more occurrences of almost
> anything followed by 0 or more occurrences of almost anything. Even
> assuming it ignores the .*, it will find the longest possible sequence
> of non-digits, then try to match the ISBN stuff. If it fails, it will
> shorten that sequence of non-digits, try the ISBN stuff again, etc etc
> until it matches the ISBN stuff or that sequence of non-digits is down
> to zero length. It will do that for each character position in the
> file contents. Why is it faster when you change \D to []? Presumably
> because in your data, sequences of non-digits are longer than
> sequences of spaces IOW there is less stuff to back-track over.
> 
> 
>> - especially when a slight modification makes it run as fast as I'd
>> expect from the beginning...
>>
>> I'm sorry I cannot supply test data, in my case, it comes from
>> copyrighted material - however if it proves needed, I can probably
>> construct dummy data to illustrate the problem
> 
> What you call "dummy data", I'd call "test data". You should make a
> sample of "dummy data" and test that your regex is (a) finding all
> ISBNs that it should (b) not reporting incorrect matches, *BEFORE*
> being concerned with speed.
> 
>> Any and all guidance would be greatly appreciated,
> 
> For some light reading :-) borrow a copy of Friedl's book (mentioned
> in the Python re docs) and read the parts on how backtracking regex
> engines work.
> 
> HTH,
> John
> 

Thanks to all of you for your replies - they have been most helpful, and 
my program is now running at a reasonable pace...

I ended up using r"\b\d{9}[0-9X]\b" which seems to do the trick - if it 
turns out to misbehave in further testing, I'll know where to turn :-P