trying to strip out non ascii.. or rather convert non ascii

Thu Oct 31 07:46:24 EDT 2013

On 2013-10-30 19:28, Roy Smith wrote:
> For example, it's reasonable to consider any vowel (or string of
> vowels, for that matter) to be closer to another vowel than to a
> consonant.  A great example is the word, "bureaucrat".  As far as
> I'm concerned, it's spelled {b, vowels, r, vowels, c, r, a, t}.  It
> usually takes me three or four tries to get auto-correct to even
> recognize what I'm trying to type and fix it for me.

[glad I'm not the only one who has trouble spelling "bureaucrat"]

Steven D'Aprano wisely mentioned elsewhere in the thread that "The
right solution to that is to treat it no differently from other fuzzy
searches. A good search engine should be tolerant of spelling errors
and alternative spellings for any letter, not just those with
diacritics."

Often the Levenshtein distance is used for calculating closeness, and
the off-the-shelf algorithm assigns a cost of one per difference
(addition, change, or removal).  It doesn't sound like it would be
that hard[1] to assign varying costs based on what character was
added/changed/removed.  A diacritic might have a cost of N while a
similar character (vowel->vowel or consonant->consonant, or
consonant-cluster shift) might have a cost of 2N, and a totally
arbitrary character shift might have a cost of 3N (or higher).
Unfortunately, the Levenshtein algorithm is already O(M*N) slow and
can't be reasonably precalculated without knowing both strings, so
this just ends up heaping additional lookups/comparisons atop
already-slow code.

-tkc

[1]
http://en.wikipedia.org/wiki/Levenshtein_distance#Possible_modifications

.