trying to strip out non ascii.. or rather convert non ascii

Steven D'Aprano steve at pearwood.info
Tue Oct 29 01:24:50 EDT 2013


On Mon, 28 Oct 2013 09:23:41 -0500, Tim Chase wrote:

> On 2013-10-28 07:01, wxjmfauth at gmail.com wrote:
>>> Simply ignoring diactrics won't get you very far.
>> 
>> Right. As an example, these four French words : cote, côte, coté, côté
>> .
> 
> Distinct words with distinct meanings, sure.
> 
> But when a naïve (naive? ☺) person or one without the easy ability to
> enter characters with diacritics searches for "cote", I want to return
> possible matches containing any of your 4 examples.  It's slightly
> fuzzier if they search for "coté", in which case they may mean "coté" or
> they might mean be unable to figure out how to add a hat and want to
> type "côté". Though I'd rather get more results, even if it has some
> that only match fuzzily.

The right solution to that is to treat it no differently from other fuzzy 
searches. A good search engine should be tolerant of spelling errors and 
alternative spellings for any letter, not just those with diacritics. 
Ideally, a good search engine would successfully match all three of 
"naïve", "naive" and "niave", and it shouldn't rely on special handling 
of diacritics.



-- 
Steven



More information about the Python-list mailing list