regex-strategy for finding *similar* words?

Christoph Pingel ch.pingel at web.de
Thu Nov 18 07:20:08 EST 2004


Hi all,

an interesting problem for regex nerds.
I've got a thesaurus of some hundred words and a moderately large 
dataset of about 1 million words in some thousand small texts. Words 
from the thesaurus appear at many places in my texts, but they are 
often misspelled, just slightly different from the thesaurus.

Now I'm looking for the best strategy to match the appearence of my 
thesaurus items in the texts. Do I have to build patterns from all my 
thesaurus items for the expected misspellings, or is there a more 
general approach possible? I tried to add '?.?' to each letter in a 
thesaurus item, but this is much too weak (I get a lot of false 
positives because this expression for example matches any possible 
substring). Adding word boundries helps a little, but sometimes a 
concept spans two words, and this could be spelled with a space char 
*or* a dash. In this case, I'm back to matching almose everything. 
Any ideas?
BTW, efficiency is not absolutely required, it's not meant to work 
for realtime requests.

TIA,
best regards,
Christoph



More information about the Python-list mailing list