Example search-engine/indexer

Thu Apr 27 16:14:18 EDT 2000

Hello,

I was wondering if anybody has an example on how to index lots of
files or/and documents for fast searching. I`m doing sort of a
Yahoo/Altavista-thing and need an example that`s fast and doesn`t rely
on an external database-system. Berkley DB is ok, and all stuff that
comes with a standard Python distro.

For now I got a function that extracts words from files/folders and
words from documents. For now I`ve done something like this :

1. recursive process all files and folders in a given folder and its
subfolders. Documents found will be opened and words extracted. Words
ends up in a dictionary, the word acting as key, pointing to a list of
ids, ex. (1, 23), where 1 is the folder id and 23 is the file-id.
Words are handled seperatly but in a similar manner.

2. What I end up with are a dictionary like this :

>>print words['python']
>>[ (1,23), (21.45), (432, 21) ]
>>print words['linux']
>>[ (12,32), (30.90), (4, 42) ]

Using shelve and db_hash to store the keys and lists of ids works fine
for small amounts of data, but when the list of ids, actually tuples
like ( 32, 43 ), get huge, into several thousands, the thing takes up
lots of space and gets slower.

Has anybody done anything similar? Dr. Dobbs Journal #295, Jan 1999
has an example using Perl. Could anybody translate the article, the
code is short and sweet, like most Perl-code, but not that easy to
understand, like most Perl-code.

Any hints on how to do something like this, a different method or
whatever would be appreciated, example code most of all.

Tudelidu!!

Thanks,
Thomas