wide strings vs. Unicode point of view (was Re: [I18n-sig] Unicode st.... alternative)

Peter Funk pf@artcom-gmbh.de
Fri, 5 May 2000 15:13:05 +0200 (MEST)


Just van Rossum:
> Exactly. By saying "(wide) strings are not tied to Unicode" the question
> whether wide strings should or should not be sorted according to the
> Unicode spec is answered by a simple "no", instead of "hmm, maybe, but it's
> too hard anyway"...

I personally like the idea speaking of "wide strings" containing wide
character codes instead of Unicode objects.

Unfortunately there are many methods which need to interpret the
content of strings according to some encoding knowledge: for example
'upper()', 'lower()', 'swapcase()', 'lstrip()' and so on need to know,
to which class certain characters belong.

This problem was already some kind of visible in 1.5.2, since these methods 
were available as library functions from the string module and they did
work with a global state maintained by the 'setlocale()' C-library function.
Quoting from the C library man pages:

"""    The details of what constitutes an uppercase or  lowercase
       letter  depend  on  the  current locale.  For example, the
       default "C" locale does not know about umlauts, so no con­
       version is done for them.

       In some non - English locales, there are lowercase letters
       with no corresponding  uppercase  equivalent;  the  German
       sharp s is one example.
"""

I guess applying 'upper' to a chinese char will not make much sense.

Now these former string module functions were moved into the Python
object core.  So the current Python string and Unicode object API is
somewhat "western centric".  ;-) At least Marc's implementation in
'unicodectype.c' contains the hard coded assumption, that wide strings
contain really unicode characters.  
print u"äöü".upper().encode("latin1") 
shows "ÄÖÜ" independent from the locale setting.  This makes sense.
The output from  print u"äöü".upper().encode()  however looks ugly
here on my screen... UTF-8 ... blech:Ã ÃÃ

Regards and have a nice weekend, Peter
-- 
Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260
office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)