trying to strip out non ascii.. or rather convert non ascii

Ned Batchelder ned at nedbatchelder.com
Wed Oct 30 08:44:47 EDT 2013


On 10/30/13 4:49 AM, wxjmfauth at gmail.com wrote:
> Le mardi 29 octobre 2013 06:24:50 UTC+1, Steven D'Aprano a écrit :
>> On Mon, 28 Oct 2013 09:23:41 -0500, Tim Chase wrote:
>>
>>
>>
>>> On 2013-10-28 07:01, wxjmfauth at gmail.com wrote:
>>>>> Simply ignoring diactrics won't get you very far.
>>>> Right. As an example, these four French words : cote, côte, coté, côté
>>>> .
>>> Distinct words with distinct meanings, sure.
>>> But when a naïve (naive? ☺) person or one without the easy ability to
>>> enter characters with diacritics searches for "cote", I want to return
>>> possible matches containing any of your 4 examples.  It's slightly
>>> fuzzier if they search for "coté", in which case they may mean "coté" or
>>> they might mean be unable to figure out how to add a hat and want to
>>> type "côté". Though I'd rather get more results, even if it has some
>>> that only match fuzzily.
>>
>>
>> The right solution to that is to treat it no differently from other fuzzy
>>
>> searches. A good search engine should be tolerant of spelling errors and
>>
>> alternative spellings for any letter, not just those with diacritics.
>>
>> Ideally, a good search engine would successfully match all three of
>>
>> "naïve", "naive" and "niave", and it shouldn't rely on special handling
>>
>> of diacritics.
>>
>>
>>
> ------
>
> This is a non sense. The purpose of a diacritical mark is to
> make a letter a different letter. If a tool is supposed to
> match an ô, there is absolutely no reason to match something
> else.
>
> jmf
>

jmf, Tim Chase described his use case, and it seems reasonable to me.  
I'm not sure why you would describe it as nonsense.

--Ned.



More information about the Python-list mailing list