String search vs regexp search

Mon Oct 13 12:40:26 EDT 2003

Sorry for being too brief!

I was talking about a function which 'counts' the number
of occurences using string & regexp. 

I wrote the code for the regexp search as well as the function
search and tested it on a rather large file (800 KB) for
occurences of a certain word. I find that the string search
is at least 2 times faster than the one with regexp, excluding
the time for the regexp.compile() method. This is particularly
noticeable when the file becomes quite large and the word is 
spread out.

I also thought the regexp would beat string thumbs down and I
am suprised at the result that it is the other way around.

Here is the code. Note that I am using the 'count' methods that
count the number of occurences rather than the 'find' methods.

# Test to find out whether string search in a data
# is faster than regexp search.

# Results: String search is much faster when it comes
# to many occurences of the sub string.

import time

def strsearch1(s, substr):

    t1 = time.time()
    print 'Count 1 =>', s.count(substr)
    t2 = time.time()
    print 'Searching using string, Time taken => ', t2 - t1

def strsearch2(s, substr):

    import re

    r=re.compile(substr, re.IGNORECASE)
    t1 = time.time()
    print 'Count 2 =>', len(r.findall(s))
    t2 = time.time()
    print 'Searching using regexp, Time taken => ', t2 - t1

data=open("test.html", "r").read()
strsearch1(data, "Miriam")
strsearch2(data, "Miriam")

# Output here...

D:\Programming\python>python strsearch.py
Count 1 => 45
Searching using string, Time taken =>  0.0599999427795
Count 2 => 45
Searching using regexp, Time taken =>  0.110000014305

Test was done on a windows 98 machine using Python 2.3, running
on 248 MB RAM, Intel 1.7 GHz chipset.

I was thinking of using regexp searches in my code, but this convinces
me to stick on to the good old string search.

Thanks for the replies.

-Anand

Duncan Booth <duncan at NOSPAMrcp.co.uk> wrote in message news:<Xns941360C9B9445duncanrcpcouk at 127.0.0.1>...
> pythonguy at Hotpop.com (Anand Pillai) wrote in 
> news:84fc4588.0310120655.ea95af6 at posting.google.com:
> 
> > To search a word in a group of words, say a paragraph or a web page,
> > would a string search or a regexp search be faster?
> > 
> > The string search would of course be,
> > 
> > if str.find(substr) != -1:
> >     domything()
> > 
> > And the regexp search assuming no case restriction would be,
> > 
> > strre=re.compile(substr, re.IGNORECASE)
> > 
> > m=strre.search(str)
> > if m:
> >    domything() 
> > 
> > I was about to do a test, then I thought someone here might have
> > some data on this already.
> > 
> Yes. The answer is 'it all depends'.
> 
> Things it depends on include:
> 
> Your two bits of code do different things, one is case sensitive, one 
> ignores case. Which did you need?
> 
> How long is the string you are searching? How long is the substring?
> 
> Is the substring the same every time, or are you always searching for 
> different strings. Can the substring contain characters with special 
> meanings for regular expressions?
> 
> The regular expression code has a startup penalty since it has to compile 
> the regular expression at least once, however the actual searching may be 
> faster than the naive str.find. If the time spent doing the search is 
> sufficiently long compared with the time doing the compile, the regular 
> expression may win out.
> 
> Bottom line: write the code so it is as clean and maintainable as possible. 
> Only worry about optimising this if you have timed it and know that your 
> searches are a bottleneck.