trying to strip out non ascii.. or rather convert non ascii

Roy Smith roy at panix.com
Sat Oct 26 21:54:27 EDT 2013


In article <mailman.1628.1382838024.18130.python-list at python.org>,
 Tim Chase <python.list at tim.thechases.com> wrote:

> I'd be just as happy if Python provided a "sloppy string compare"
> that ignored case, diacritical marks, and the like.

The problem with putting fuzzy matching in the core language is that 
there is no general agreement on how it's supposed to work.

There are, however, third-party libraries which do fuzzy matching.  One 
popular one is jellyfish (https://pypi.python.org/pypi/jellyfish/0.1.2).  
Don't expect you can just download and use it right out of the box, 
however. You'll need to do a little thinking about which of the several 
algorithms it includes makes sense for your application.

So, for example, you probably expect U+004 (Latin Capital letter N) to 
match U+006 (Latin Small Letter N).  But, what about these (all cribbed 
from Wikipedia):

U+00D1   Ñ	Ñ  Ñ Latin Capital letter N with tilde
U+00F1   ñ	ñ  ñ Latin Small Letter N with tilde
U+0143   C  Ń      Latin Capital Letter N with acute
U+0144   D  ń      Latin Small Letter N with acute
U+0145   E  Ņ      Latin Capital Letter N with cedilla
U+0146   F  ņ      Latin Small Letter N with cedilla
U+0147   G  Ň      Latin Capital Letter N with caron
U+0148   H  ň      Latin Small Letter N with caron
U+0149   I  ʼn      Latin Small Letter N preceded by apostrophe[1]
U+014A   J  Ŋ      Latin Capital Letter Eng
U+014B   K  ŋ      Latin Small Letter Eng
U+019D   #413;   Latin Capital Letter N with left hook
U+019E   #414;   Latin Small Letter N with long right leg
U+01CA   #458;   Latin Capital Letter NJ
U+01CB   #459;   Latin Capital Letter N with Small Letter J
U+01CC   #460;   Latin Small Letter NJ
U+0235   #565;   Latin Small Letter N with curl

I can't even begin to guess if they should match for your application.



More information about the Python-list mailing list