string search function

Christopher T King squirrel at WPI.EDU
Fri Jul 23 11:17:31 EDT 2004


On 23 Jul 2004, gyromagnetic wrote:

> Are there improvements that can be made to the code below? Are there
> better alternatives?

There are two ways I can think of:

1)

You can replace the 'or' search with a regular expression:

 import re

 def bsearch(fterms, stext, btype='AND'):
     if btype == 'AND':         # boolean 'and' search
         ...
     else:                       # boolean 'or' search
         found = bool(re.search(fterms.join('|'),stext))

     return found

Assuming Python's regexps are somewhat optimized, this should be about the 
fastest search you can get with arbitrary strings.  Unfortunately, there's 
no direct 'and' equivalent with regexps (it can be done, but it's helluva 
ugly).  If you want to optimize 'and', you'll need to use whatever 
advanced algorithms our friends at Google use ;)

2)

Assuming your search terms don't contain whitespace, and you want to
compare only entire words (not substrings), replace the string search with
a list search, simply by adding the line 'stext = stext.split()' to the
top of the search function.  This will greatly speed up the search by
avoiding checking partial words.  The same effect can be achieved with the 
regexp above by appending '\s' to the beginning and end of it.

Continuing along this line (whole words, no whitespace), you can replace 
the lists with sets for an even greater speedup.  The resulting code would 
look like this (with a bit of code stolen from sets.py):

 from sets import Set

 def ispartialsubset(self, other):
     """Report whether another set contains at least one member of this set."""
     self._binary_sanity_check(other)
     for elt in ifilter(other._data.has_key, self):
         return True
     return False

 Set.ispartialsubset = ispartialsubset

 def bsearch(fterms, stext, btype='AND'):
     stext = stext.split()
     if btype == 'AND':         # boolean 'and' search
         found = Set(fterms).issubset(Set(stext))
     else:                       # boolean 'or' search
         found = Set(fterms).ispartialsubset(Set(stext))
     return found

If the same stext or fterms is used multiple times, you could move the 
.split() and/or the Set() transformations outside of the function, so they 
only have to be done once.

Again, this method (using sets) only works if your words contain no 
whitespace, and you're not interested in matching parts of words.  It's 
/much/ faster than a string search, though.

Hope this helps!




More information about the Python-list mailing list