Text indexing and searching (was "Re: Python stuff")

Fri Jan 7 08:44:51 EST 2000

    keagan> Thanks for replying to my post on the newsgroup, but as i said,
    keagan> i am a newbie. i didn't understand all of your post. I was
    keagan> wondering if i could creat a dictionary that would search
    keagan> through a file to find matches:

Well, indirectly.  I'm not much of a text search person, but here's one
approach, assuming you're only interested in whole words.  Suppose your file
was

The quick brown fox ...
My country 'tis of thee ....
Mavis told the commissioner the brown fox was too slow.

You could create a dictionary whose keys are the words in the file and whose 
values are lists of offsets into the file.  Let's also assume
case-insensitive searching, so you might have a dictionary that looks like

{
    'the': [0, 64, 81],
    'quick': [4],
    'brown': [10, 85],
    etc.

}

Then, if someone searched for "brown", you would access the dictionary using
"brown" as the key.  This would return the list [10, 85].  You'd then
highlight the words at offsets 10 and 85.

Obviously, for large files, that dictionary could get very big, and most of
the time you wouldn't use much of it.  The natural way to store it is in a
dbm file of some sort, which acts pretty much like a dictionary but is
stored on disk.

Here's a simpleminded script that takes a single text file and creates an
index such as the one I described.

    #!/usr/bin/env python

    """given a file in sys.argv[1], create a keyword-in-context index in the
    file in sys.argv[2], e.g.:

	build-offset.py my-story.txt my-story-index.db

    """

    import re, shelve, string, sys

    # note the conspicuous lack of any error checking!
    f = open(sys.argv[1])
    db = shelve.open(sys.argv[2])
    wordpat = re.compile('(\W+)')

    print "indexing", sys.argv[1], "into", sys.argv[2]

    offset = 0
    wordchars = string.letters + string.digits
    while 1:
	line = f.readline()
	if not line: break
	line = re.split(wordpat, string.lower(line))
	# pick through the resulting list of words and non-words registering
	# offsets...
	for i in range(len(line)):
	    if line[i] and line[i][0] in wordchars:
		lst = db.get(line[i], [])
		lst.append(offset)
		db[line[i]] = lst
	    offset = offset + len(line[i])

    f.close()

    # demonstrate that we have a valid db...
    keys = db.keys()
    keys.sort()
    for k in keys:
	print "%s: %s" % (k, db[k])
    db.close()

For the three-line file above, when run it displays:

    indexing fox.txt into fox.db
    brown: [10, 85]
    commissioner: [68]
    country: [27]
    fox: [16, 91]
    mavis: [53]
    my: [24]
    of: [40]
    quick: [4]
    slow: [103]
    the: [0, 64, 81]
    thee: [43]
    tis: [36]
    told: [59]
    too: [99]
    was: [95]

(These offsets all assume Unix line endings (LF only).  On a PC you will see
slightly different values because of the CRLF line endings.)

One thing that's also worth noting...  When you're looking for help with
something, it's best to followup to the list to which you originally posted
your question.  There are plenty of people out there who can help.  By
limiting your followups to the first person who happens to respond, you lose
the collective benefit of the community as a whole.  In particular, I would
be quite surprised if someone on the list didn't have a better idea about
text searching or even pointers to modules that solve the problem in a much
better way than I've outlined.  Accordingly, I've cc'd
python-list at python.org to shed some more light on the subject.  (The tutor
list (python-tutor at python.org) is also an excellent source of help.)

Skip Montanaro | http://www.mojam.com/
skip at mojam.com | http://www.musi-cal.com/
847-971-7098   | Python: Programming the way Guido indented...