Most efficient method to search text?

Wed Oct 16 09:35:09 EDT 2002

robin.siebler at corp.palm.com (Robin Siebler) writes:

> I wrote a script to search a slew of files for certain words and
> names.  However, I am sure that there has to be a faster/better way to
> do it.

I should point out that I'm not really replying in order to help you
-- if I do, that's a bonus.  I've actually wanted to sit down and
write this code out for a while, and this seemed like a good excuse.

Here's a way to quickly (I hope! Haven't done any benchmarks) tell if
one of a bunch of words is contained in a chunk of text, assuming the
words are known beforehand:

import string

wordchars = string.letters
allchars = map(chr, range(256))
hittoken = 'hit'

def _compile(wordlist, d, skipalong):
    t = {}
    for word in wordlist:
        t.setdefault(word[0], []).append(word[1:])
    for k, v in t.iteritems():
        d[k] = {}
        if '' in v:
            for c in allchars:
                d[k][c] = hittoken
            for c in wordchars:
                d[k][c] = skipalong
        else:
            for c in allchars:
                d[k][c] = skipalong
        _compile([y for y in v if y <> ''], d[k], skipalong)

def compile(wordlist):
    root = {}
    skipalong = {}
    for c in allchars:
        skipalong[c] = root
    for c in wordchars:
        skipalong[c] = skipalong
    for c in allchars:
        root[c] = root
    _compile(wordlist, root, skipalong)
    return root

def match(set, text):
    _hittoken = hittoken
    for c in text + '\000':
        set = set[c]
        if set is _hittoken:
            return 1
    else:
        return 0

You probably don't want to try and print the kind of things compile()
returns...

I don't really feel like explaining it, either, but it shouldn't be
too hard to grasp the principles if you've heard of a dfa.

You use it like this:

>>> set = robin.compile(['Tim', 'Guido', 'Barry'])
>>> files = glob.glob(os.path.expanduser("~/src/python/dist/src/Lib/*.py"))
>>> for file in files:
...  if robin.match(s, open(file).read()):
...      print os.path.basename(file)
... 
cgi.py
doctest.py
getpass.py
gettext.py
imputil.py
profile.py
pstats.py
pty.py
pydoc.py
random.py
rfc822.py
site.py
smtpd.py
tabnanny.py
telnetlib.py
threading.py
tokenize.py
whrandom.py
xmlrpclib.py
sets.py
heapq.py
>>> 

There are surely cleverer Boyer-Moore tricks you can pull, but I
personally think this code is really really neat.  You should be able
to write lexers very like this (though not full-on regexps --
backtracking ain't gonna fit, here).

Cheers,
M.

-- 
  ROOSTA:  Ever since you arrived on this planet last night you've
           been going round telling people that you're Zaphod
           Beeblebrox, but that they're not to tell anyone else.
                    -- The Hitch-Hikers Guide to the Galaxy, Episode 7