RegExp performance?
Kirk Sluder
kirk at nospam.jobsluder.net
Sun Feb 25 22:01:29 EST 2007
In article <45e1d367$0$90273$14726298 at news.sunsite.dk>,
Christian Sonne <FreakCERS at gmail.com> wrote:
> Thanks to all of you for your replies - they have been most helpful, and
> my program is now running at a reasonable pace...
>
>
> I ended up using r"\b\d{9}[0-9X]\b" which seems to do the trick - if it
> turns out to misbehave in further testing, I'll know where to turn :-P
Anything with variable-length wildcard matching (*+?) is going to
drag your performance down. There was an earlier thread on this very
topic. Another stupid question is how are you planning on handling
ISBNs formatted with hyphens for readability?
In general I've found the following factors to be critical in
getting good performance from re:
1: Reducing the number of times you call re.match or re.search.
2: Reducing the number of bytes that re has to search through.
3: Minimizing the use of wildcards in the expression.
If you can pre-filter your input with string.find before running
re.match you will improve performance quite a bit compared to
running re expressions over all 10 pages. I played around a bit and
attached some example code below searching over 21K of text for the
ISBN number. testPrefilter() runs about 1/5th the execution time of
line-by-line re calls or a single re call over a 21K string.
Interestingly this ratio scales up to something as big as Mobey
Dick.
The searchLabels() functions below beats all the other functions by
searching for "ISBN", or "International Book" and then using RE on
the surrounding 500 bytes. You might also try searching for
"Copyright" or "Library of Congress" since most modern books will
have it all on the same page.
A caveat here is that this only works if you can find a reasonably
unique string at or near what you want to find with re. If you need
to run re.search on every byte of the file anyway, this isn't going
to help.
---------------
timing test code
---------------
#!/usr/bin/env python
from timeit import Timer
import re
textString = """The text of a sample page using with an ISBN 10
number ISBN 0672328976 and some more text to compare."""
#add the full text of Mobey Dick to make the search functions
#work for their bread.
fileHandle= open("./mobey.txt")
junkText = fileHandle.readlines()
junkText.append(textString)
textString=''.join(junkText)
#print textString
#compile the regex
isbn10Re = re.compile(r"\b\d{9}[0-9X]\b")
def testPrefilter():
"""Work through a pre-loaded array running re only on lines
containing ISBN"""
for line in junkText:
#search for 'ISBN"
if (line.find('ISBN') > -1):
thisMatch = isbn10Re.search(line)
if thisMatch:
return thisMatch.group(0)
def testNofilter():
"""Run re.search on every line."""
for line in junkText:
#seaching using RE
thisMatch = isbn10Re.search(line)
if thisMatch:
return thisMatch.group(0)
def testFullre():
"""Run re.search on a long text string."""
thisMatch = isbn10Re.search(textString)
if thisMatch:
return thisMatch.group(0)
def searchLabels():
#identify some text that might be near an ISBN number.
isbnLabels=["ISBN",
"International Book"]
#use the fast string.find method to locate those
#labels in the text
isbnIndexes = [textString.find(x) for x in isbnLabels]
#exclude labels not found in the text.
isbnIndexes = [y for y in isbnIndexes if y > -1]
#run re.search on a 500 character string around the text label.
for x in isbnIndexes:
thisMatch=isbn10Re.search(textString[x-250:x+250])
return thisMatch.group(0)
#print searchLabels()
#print testPrefilter()
#print testNofilter()
t = Timer("testNofilter()","from __main__ import testNofilter")
print t.timeit(100)
u = Timer("testPrefilter()","from __main__ import testPrefilter")
print u.timeit(100)
v = Timer("testFullre()","from __main__ import testFullre")
print v.timeit(100)
w = Timer("searchLabels()", "from __main__ import searchLabels")
print w.timeit(100)
More information about the Python-list
mailing list