regex-strategy for finding *similar* words?

Peter Maas peter at somewhere.com
Thu Nov 18 07:46:21 EST 2004


Christoph Pingel schrieb:
> Hi all,
> 
> an interesting problem for regex nerds.
> I've got a thesaurus of some hundred words and a moderately large 
> dataset of about 1 million words in some thousand small texts. Words 
> from the thesaurus appear at many places in my texts, but they are often 
> misspelled, just slightly different from the thesaurus.

You could set up a list of misspelling cases, scan a word for it e.g.
citti and turn it into a regex by applying suitable misspelling cases
But this is cumbersome. It is probably better to use a string distance
defined by the least number of operations (add,delete, replace, exchange)
to map one string onto another.

Search for '"Levenshtein distance" python' and find e.g.

http://trific.ath.cx/resources/python/levenshtein/

-- 
-------------------------------------------------------------------
Peter Maas,  M+R Infosysteme,  D-52070 Aachen,  Tel +49-241-93878-0
E-mail 'cGV0ZXIubWFhc0BtcGx1c3IuZGU=\n'.decode('base64')
-------------------------------------------------------------------



More information about the Python-list mailing list