RegExp performance?

Sun Feb 25 22:01:29 EST 2007

In article <45e1d367$0$90273$14726298 at news.sunsite.dk>,
 Christian Sonne <FreakCERS at gmail.com> wrote:

> Thanks to all of you for your replies - they have been most helpful, and 
> my program is now running at a reasonable pace...
> 
> 
> I ended up using r"\b\d{9}[0-9X]\b" which seems to do the trick - if it 
> turns out to misbehave in further testing, I'll know where to turn :-P

Anything with variable-length wildcard matching (*+?) is going to 
drag your performance down. There was an earlier thread on this very 
topic.  Another stupid question is how are you planning on handling 
ISBNs formatted with hyphens for readability?

In general I've found the following factors to be critical in 
getting good performance from re:

1: Reducing the number of times you call re.match or re.search.
2: Reducing the number of bytes that re has to search through.
3: Minimizing the use of wildcards in the expression.

If you can pre-filter your input with string.find before running 
re.match you will improve performance quite a bit compared to 
running re expressions over all 10 pages.  I played around a bit and 
attached some example code below searching over 21K of text for the 
ISBN number.  testPrefilter() runs about 1/5th the execution time of 
line-by-line re calls or a single re call over a 21K string.  
Interestingly this ratio scales up to something as big as Mobey 
Dick.  

The searchLabels() functions below beats all the other functions by 
searching for "ISBN", or "International Book" and then using RE on 
the surrounding 500 bytes.  You might also try searching for 
"Copyright" or "Library of Congress" since most modern books will 
have it all on the same page.  

A caveat here is that this only works if you can find a reasonably 
unique string at or near what you want to find with re.  If you need 
to run re.search on every byte of the file anyway, this isn't going 
to help.  

---------------
timing test code
---------------

#!/usr/bin/env python

from timeit import Timer
import re

textString = """The text of a sample page using with an ISBN 10
number ISBN 0672328976 and some more text to compare."""

#add the full text of Mobey Dick to make the search functions
#work for their bread.
fileHandle= open("./mobey.txt")
junkText = fileHandle.readlines()
junkText.append(textString)
textString=''.join(junkText)
#print textString

#compile the regex
isbn10Re = re.compile(r"\b\d{9}[0-9X]\b")

def testPrefilter():
    """Work through a pre-loaded array running re only on lines 
containing ISBN""" 
    for line in junkText:
        #search for 'ISBN"
        if (line.find('ISBN') > -1):
            thisMatch = isbn10Re.search(line)
            if thisMatch:
                return thisMatch.group(0)

def testNofilter():
    """Run re.search on every line."""
    for line in junkText:
        #seaching using RE
        thisMatch = isbn10Re.search(line)
        if thisMatch:
            return thisMatch.group(0)

def testFullre():
    """Run re.search on a long text string."""
    thisMatch = isbn10Re.search(textString)
    if thisMatch:
        return thisMatch.group(0)

def searchLabels():        
    #identify some text that might be near an ISBN number.
    isbnLabels=["ISBN",
       "International Book"]

    #use the fast string.find method to locate those 
    #labels in the text
    isbnIndexes = [textString.find(x) for x in isbnLabels]

    #exclude labels not found in the text.
    isbnIndexes = [y for y in isbnIndexes if y > -1]

    #run re.search on a 500 character string around the text label.
    for x in isbnIndexes:
        thisMatch=isbn10Re.search(textString[x-250:x+250])
        return thisMatch.group(0)

#print searchLabels()
#print testPrefilter()
#print testNofilter()
t = Timer("testNofilter()","from __main__ import testNofilter")
print t.timeit(100)
u = Timer("testPrefilter()","from __main__ import testPrefilter")
print u.timeit(100)
v = Timer("testFullre()","from __main__ import testFullre")
print v.timeit(100)
w = Timer("searchLabels()", "from __main__ import searchLabels")
print w.timeit(100)