[Python-Dev] Internationalization Toolkit

M.-A. Lemburg mal@lemburg.com
Fri, 12 Nov 1999 14:01:28 +0100


Fredrik Lundh wrote:
> 
> Mike wrote:
> > Surely using a different type on different platforms means that we throw
> > away the concept of a platform independent Unicode string?
> > I.e. on Solaris, wchar_t is 32 bits, on Windows it is 16 bits.
> 
> so?  the interchange format doesn't have to be
> the same as the internal format, does it?

The interchange format (marshal + pickle) is defined as UTF-8,
so there's no problem with endianness or missing bits w/r to
shipping Unicode data from one platform to another.
 
> > Does this mean that to transfer a file between a Windows box and Solaris, an
> > implicit conversion has to be done to go from 16 bits to 32 bits (and vice
> > versa)?  What about byte ordering issues?
> 
> no problem at all: unicode has special byte order
> marks for this purpose (and utf-8 doesn't care, of
> course).

Access to this mark will go into sys: sys.bom.
 
> > Or do you mean whatever 16 bit data type is available on the platform, with
> > a standard (platform independent) byte ordering maintained?
> 
> well, my preference is a 16-bit data type in the plat-
> form's native byte order (exactly how it's done in the
> unicode module -- for the moment, it can use the
> platform's wchar_t, but only if it happens to be a
> 16-bit unsigned type).  gives you good performance,
> compact storage, and cleanest possible code.

The 0.4 proposal fixes this to 16-bit unsigned short
using UTF-16 encoding with checks for surrogates. This covers
all defined standard Unicode character points, is fast, etc. pp...

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/