regex-strategy for finding *similar* words?

John Machin sjmachin at lexicon.net
Thu Nov 18 21:13:15 EST 2004


Peter Maas <peter at somewhere.com> wrote in message news:<cni5mo$d12$1 at swifty.westend.com>...
> Christoph Pingel schrieb:
> > Hi all,
> > 
> > an interesting problem for regex nerds.
> > I've got a thesaurus of some hundred words and a moderately large 
> > dataset of about 1 million words in some thousand small texts. Words 
> > from the thesaurus appear at many places in my texts, but they are often 
> > misspelled, just slightly different from the thesaurus.
> 
> You could set up a list of misspelling cases, scan a word for it e.g.
> citti and turn it into a regex by applying suitable misspelling cases
> But this is cumbersome. It is probably better to use a string distance
> defined by the least number of operations (add,delete, replace, exchange)
> to map one string onto another.
> 
> Search for '"Levenshtein distance" python' and find e.g.
> 
> http://trific.ath.cx/resources/python/levenshtein/

Forget regexes. A string distance function is a helpful start. Firstly
you need a method of parsing your input text into words or phrases.
Secondly you need a fuzzy dictionary into which you load your
thesaurus. This fuzzy dictionary will have a method like
fuzzy_dict.approx_lookup('mispeled', 2) which would return you a list
of words in the dictionary that had a (say) Levenshtein distance of 2
or less from your input word. Algorithms for fuzzy dictionaries:
google for "Burkhard Keller tree" and "permuted lexicon".
Alternatively you could look for a spelling checker with publicly
available source and grab some ideas or even code. You might be
interested in other string distance measures than Levenshtein distance
e.g. the so-called Damerau distance which counts transpositions. There
are faster algorithms available than the dynamic-programming one --
google "Heikki Hyyro". These allow you to preprocess your input word
and then compare with multiple dictionary candidates in O(kn) instead
of O(kmn) time where the input word has length m, and you compare with
k dictionary words each of average length n.

HTH,
John



More information about the Python-list mailing list