trying to strip out non ascii.. or rather convert non ascii

Steven D'Aprano steve+comp.lang.python at pearwood.info
Fri Nov 1 03:16:36 EDT 2013


On Thu, 31 Oct 2013 03:33:15 -0700, wxjmfauth wrote:

> Le jeudi 31 octobre 2013 08:10:18 UTC+1, Steven D'Aprano a écrit :

>> I'm glad that you know so much better than Google, Bing, Yahoo, and
>> other
>> search engines. When I search for "mispealled" Google gives me:
[...]
> As far as I know, I recognized my mistake. I had more text processing
> systems in mind, than search engines.

Yes, you have, I acknowledge that now. I see now that at the time I made 
my response to you, you had already replied recognising your error. 
Unfortunately I had not seen that. So in that case, I withdraw my 
comments and apologize.


> I can even tell you, I am really stupid. I wrote pure Unicode software
> to sort French or German strings.
> 
> Pure unicode == independent from any locale.

Unfortunately it is not that simple. The same code point can have 
different meanings in different languages, and should be treated 
differently when sorting. The natural Unicode sort order satisfies very 
few European languages, including English. A few examples:

* Swedish ä is a distinct letters of the alphabet, appearing 
  after z: "a b c z ä" is sorted according to Swedish rules.
  But in German ä is considered to be the letter 'a' plus an
  umlaut, and is collated after 'a': "a ä b c z" is sorted 
  according to German rules.

* In German ö is considered to be a variant of o, equivalent
  to 'oe', while in Finish ö is a distinct letter which 
  cannot be expanded to 'oe', and which appears at the end
  of the alphabet.

* Similarly, in modern English æ is a ligature of ae, while in
  Danish and Norwegian is it a distinct letter of the alphabet
  appearing after z: in English dictionaries, "Æsir" will be 
  found with other "A" words, often expanded to "Aesir", while
  in Norwegian it will be found after "Z" words.

* Most European languages convert uppercase I to lowercase i, 
  but Turkish has distinct letters for dotted and dotless I. 
  According to Turkish rules, lowercase(I) is ı and uppercase(i)
  is İ.


While it is true that the Unicode character set is independent of locale, 
for natural processing of characters, it isn't enough to just use Unicode.


-- 
Steven



More information about the Python-list mailing list