trying to strip out non ascii.. or rather convert non ascii

Roy Smith roy at panix.com
Wed Oct 30 19:28:43 EDT 2013


In article <mailman.1821.1383156703.18130.python-list at python.org>,
 Michael Torrie <torriem at gmail.com> wrote:

> On 10/30/2013 10:08 AM, wxjmfauth at gmail.com wrote:
> > My comment had nothing to do with Python, it was a
> > general comment. A diacritical mark just makes a letter
> > a different letter; a "ï " and a "i" are "as
> > diferent" as a "a" from a "z". A diacritical mark
> > is more than a simple ornementation.
> 
> That's nice, but you didn't actually read what Ned said (or the OP).
> The OP doesn't care that "ï " and a "i" are as different as "a" and "z".
> For the purposes of his search he wants them treated as the same
> letter.  A fuzzy searching treats them all the same.

That's one definition of fuzzy.  But, there's nothing that says you 
can't build a fuzzy matching algorithm which considers some mismatches 
to be worse than others.

For example, it's reasonable to consider any vowel (or string of vowels, 
for that matter) to be closer to another vowel than to a consonant.  A 
great example is the word, "bureaucrat".  As far as I'm concerned, it's 
spelled {b, vowels, r, vowels, c, r, a, t}.  It usually takes me three 
or four tries to get auto-correct to even recognize what I'm trying to 
type and fix it for me.

Likewise for pairs like {c, s}, {j, g}, {v, w}, and so on.

In that spirit, I would think that a, á, and â would all be considered 
more conservative replacements for each other than they would be for k, 
x, or z.



More information about the Python-list mailing list