trying to strip out non ascii.. or rather convert non ascii

wxjmfauth at gmail.com wxjmfauth at gmail.com
Fri Nov 1 05:00:52 EDT 2013


Le vendredi 1 novembre 2013 08:16:36 UTC+1, Steven D'Aprano a écrit :
> On Thu, 31 Oct 2013 03:33:15 -0700, wxjmfauth wrote:
> 
> 
> 
> > Le jeudi 31 octobre 2013 08:10:18 UTC+1, Steven D'Aprano a écrit :
> 
> 
> 
> >> I'm glad that you know so much better than Google, Bing, Yahoo, and
> 
> >> other
> 
> >> search engines. When I search for "mispealled" Google gives me:
> 
> [...]
> 
> > As far as I know, I recognized my mistake. I had more text processing
> 
> > systems in mind, than search engines.
> 
> 
> 
> Yes, you have, I acknowledge that now. I see now that at the time I made 
> 
> my response to you, you had already replied recognising your error. 
> 
> Unfortunately I had not seen that. So in that case, I withdraw my 
> 
> comments and apologize.
> 
> 
> 
> 
> 
> > I can even tell you, I am really stupid. I wrote pure Unicode software
> 
> > to sort French or German strings.
> 
> > 
> 
> > Pure unicode == independent from any locale.
> 
> 
> 
> Unfortunately it is not that simple. The same code point can have 
> 
> different meanings in different languages, and should be treated 
> 
> differently when sorting. The natural Unicode sort order satisfies very 
> 
> few European languages, including English. A few examples:
> 
> 
> 
> * Swedish ä is a distinct letters of the alphabet, appearing 
> 
>   after z: "a b c z ä" is sorted according to Swedish rules.
> 
>   But in German ä is considered to be the letter 'a' plus an
> 
>   umlaut, and is collated after 'a': "a ä b c z" is sorted 
> 
>   according to German rules.
> 
> 
> 
> * In German ö is considered to be a variant of o, equivalent
> 
>   to 'oe', while in Finish ö is a distinct letter which 
> 
>   cannot be expanded to 'oe', and which appears at the end
> 
>   of the alphabet.
> 
> 
> 
> * Similarly, in modern English æ is a ligature of ae, while in
> 
>   Danish and Norwegian is it a distinct letter of the alphabet
> 
>   appearing after z: in English dictionaries, "Æsir" will be 
> 
>   found with other "A" words, often expanded to "Aesir", while
> 
>   in Norwegian it will be found after "Z" words.
> 
> 
> 
> * Most European languages convert uppercase I to lowercase i, 
> 
>   but Turkish has distinct letters for dotted and dotless I. 
> 
>   According to Turkish rules, lowercase(I) is ı and uppercase(i)
> 
>   is İ.
> 
> 
> 
> 
> 
> While it is true that the Unicode character set is independent of locale, 
> 
> for natural processing of characters, it isn't enough to just use Unicode.
> 
> 
> 
> 
> 
> -- 
> 
> Steven


I'm aware of all the points you gave. That's why
I wrote "French or German strings".

The hard task is not on the side of Unicode or sorting,
it is on the creation of key(s) used for sorting.

Eg, cote, côte, coté, côté. French editors are not all
sorting these words in the same way (diacritics).

jmf

PS A *real* case to test the FSR.




More information about the Python-list mailing list