regex-strategy for finding *similar* words?

Thomas Guettler guettli at thomas-guettler.de
Thu Nov 18 10:35:17 EST 2004


Am Thu, 18 Nov 2004 13:20:08 +0100 schrieb Christoph Pingel:

> Hi all,
> 
> an interesting problem for regex nerds.
> I've got a thesaurus of some hundred words and a moderately large 
> dataset of about 1 million words in some thousand small texts. Words 
> from the thesaurus appear at many places in my texts, but they are 
> often misspelled, just slightly different from the thesaurus.

Hi,

You can write a method which takes a single word, 
and returns a normalized version.

normalize("...ies") --> "...y"

normalize("running") --> "run"

Build a big dictionary which maps each word
to a list of files where they occur. Only
add normalized words to the dictionary (or database).

bigdict={"foo": ["file1.txt", "file2.txt", ...]}

HTH,
 Thomas





More information about the Python-list mailing list