String search vs regexp search
Anand Pillai
pythonguy at Hotpop.com
Mon Oct 13 12:40:26 EDT 2003
Sorry for being too brief!
I was talking about a function which 'counts' the number
of occurences using string & regexp.
I wrote the code for the regexp search as well as the function
search and tested it on a rather large file (800 KB) for
occurences of a certain word. I find that the string search
is at least 2 times faster than the one with regexp, excluding
the time for the regexp.compile() method. This is particularly
noticeable when the file becomes quite large and the word is
spread out.
I also thought the regexp would beat string thumbs down and I
am suprised at the result that it is the other way around.
Here is the code. Note that I am using the 'count' methods that
count the number of occurences rather than the 'find' methods.
# Test to find out whether string search in a data
# is faster than regexp search.
# Results: String search is much faster when it comes
# to many occurences of the sub string.
import time
def strsearch1(s, substr):
t1 = time.time()
print 'Count 1 =>', s.count(substr)
t2 = time.time()
print 'Searching using string, Time taken => ', t2 - t1
def strsearch2(s, substr):
import re
r=re.compile(substr, re.IGNORECASE)
t1 = time.time()
print 'Count 2 =>', len(r.findall(s))
t2 = time.time()
print 'Searching using regexp, Time taken => ', t2 - t1
data=open("test.html", "r").read()
strsearch1(data, "Miriam")
strsearch2(data, "Miriam")
# Output here...
D:\Programming\python>python strsearch.py
Count 1 => 45
Searching using string, Time taken => 0.0599999427795
Count 2 => 45
Searching using regexp, Time taken => 0.110000014305
Test was done on a windows 98 machine using Python 2.3, running
on 248 MB RAM, Intel 1.7 GHz chipset.
I was thinking of using regexp searches in my code, but this convinces
me to stick on to the good old string search.
Thanks for the replies.
-Anand
Duncan Booth <duncan at NOSPAMrcp.co.uk> wrote in message news:<Xns941360C9B9445duncanrcpcouk at 127.0.0.1>...
> pythonguy at Hotpop.com (Anand Pillai) wrote in
> news:84fc4588.0310120655.ea95af6 at posting.google.com:
>
> > To search a word in a group of words, say a paragraph or a web page,
> > would a string search or a regexp search be faster?
> >
> > The string search would of course be,
> >
> > if str.find(substr) != -1:
> > domything()
> >
> > And the regexp search assuming no case restriction would be,
> >
> > strre=re.compile(substr, re.IGNORECASE)
> >
> > m=strre.search(str)
> > if m:
> > domything()
> >
> > I was about to do a test, then I thought someone here might have
> > some data on this already.
> >
> Yes. The answer is 'it all depends'.
>
> Things it depends on include:
>
> Your two bits of code do different things, one is case sensitive, one
> ignores case. Which did you need?
>
> How long is the string you are searching? How long is the substring?
>
> Is the substring the same every time, or are you always searching for
> different strings. Can the substring contain characters with special
> meanings for regular expressions?
>
> The regular expression code has a startup penalty since it has to compile
> the regular expression at least once, however the actual searching may be
> faster than the naive str.find. If the time spent doing the search is
> sufficiently long compared with the time doing the compile, the regular
> expression may win out.
>
> Bottom line: write the code so it is as clean and maintainable as possible.
> Only worry about optimising this if you have timed it and know that your
> searches are a bottleneck.
More information about the Python-list
mailing list