Most efficient method to search text?
Michael Hudson
mwh at python.net
Wed Oct 16 09:35:09 EDT 2002
robin.siebler at corp.palm.com (Robin Siebler) writes:
> I wrote a script to search a slew of files for certain words and
> names. However, I am sure that there has to be a faster/better way to
> do it.
I should point out that I'm not really replying in order to help you
-- if I do, that's a bonus. I've actually wanted to sit down and
write this code out for a while, and this seemed like a good excuse.
Here's a way to quickly (I hope! Haven't done any benchmarks) tell if
one of a bunch of words is contained in a chunk of text, assuming the
words are known beforehand:
import string
wordchars = string.letters
allchars = map(chr, range(256))
hittoken = 'hit'
def _compile(wordlist, d, skipalong):
t = {}
for word in wordlist:
t.setdefault(word[0], []).append(word[1:])
for k, v in t.iteritems():
d[k] = {}
if '' in v:
for c in allchars:
d[k][c] = hittoken
for c in wordchars:
d[k][c] = skipalong
else:
for c in allchars:
d[k][c] = skipalong
_compile([y for y in v if y <> ''], d[k], skipalong)
def compile(wordlist):
root = {}
skipalong = {}
for c in allchars:
skipalong[c] = root
for c in wordchars:
skipalong[c] = skipalong
for c in allchars:
root[c] = root
_compile(wordlist, root, skipalong)
return root
def match(set, text):
_hittoken = hittoken
for c in text + '\000':
set = set[c]
if set is _hittoken:
return 1
else:
return 0
You probably don't want to try and print the kind of things compile()
returns...
I don't really feel like explaining it, either, but it shouldn't be
too hard to grasp the principles if you've heard of a dfa.
You use it like this:
>>> set = robin.compile(['Tim', 'Guido', 'Barry'])
>>> files = glob.glob(os.path.expanduser("~/src/python/dist/src/Lib/*.py"))
>>> for file in files:
... if robin.match(s, open(file).read()):
... print os.path.basename(file)
...
cgi.py
doctest.py
getpass.py
gettext.py
imputil.py
profile.py
pstats.py
pty.py
pydoc.py
random.py
rfc822.py
site.py
smtpd.py
tabnanny.py
telnetlib.py
threading.py
tokenize.py
whrandom.py
xmlrpclib.py
sets.py
heapq.py
>>>
There are surely cleverer Boyer-Moore tricks you can pull, but I
personally think this code is really really neat. You should be able
to write lexers very like this (though not full-on regexps --
backtracking ain't gonna fit, here).
Cheers,
M.
--
ROOSTA: Ever since you arrived on this planet last night you've
been going round telling people that you're Zaphod
Beeblebrox, but that they're not to tell anyone else.
-- The Hitch-Hikers Guide to the Galaxy, Episode 7
More information about the Python-list
mailing list