Any module or library for full-text indexing?
Russell Turpin
noone at do.not.use
Tue May 9 14:11:37 EDT 2000
I'm looking for a Python module that does full text
indexing, ie, that extracts a set of significant words
from a text document, and searches for a candidate word
in a list of words so extracted. The module should
solve the following problems:
COMMON WORD MANAGEMENT. No one wants to index on common
words such as "the," "of," and "what." Ideally, a module
that does full-text indexing would have some tool for
managing the set of words that are defined as "common."
Words not commonly in a dictionary, such as "Noam" and
"Chomsky," are significant and should be indexed.
COGNATES. The module should have some way of identifying
variations of the same word when searching the index,
ie, "goose" would also match on "geese," "mouse" on
"mice," and "456" on "four-hundred fifty-six." This
requires the module to have or make use of a language
dictionary in some form. (I would be more than happy
with noun cognates. Yeah, the number example is hard,
and not required.)
The package does not need to implement a persistence
mechanism, nor manage the indices and their referents. In
other words, the core functions I am looking for are:
extract_significant: text -> word_list
find: word, word_list -> set of hits
These would be trivial functions if not for the
linguistic aspects as described above, and it is
precisely these problem for which I'm hoping to find a
solution. Of course, if the module goes further, that
is great.
If there is no existing Python module for this, I would
be interested in any C package that could be adapted
toward this end. In this case, I would try to wrap the
C package as a Python module, and make it available for
other Python programmers.
If there is no C package, I'll consider anything that
can run on a Linux box.
If there is no package that does this, I'll go out
on the glacier and eat ice worms.
Thanks!
Russell
More information about the Python-list
mailing list