[Python-Dev] Internationalization Toolkit
M.-A. Lemburg
mal@lemburg.com
Fri, 12 Nov 1999 14:01:28 +0100
Fredrik Lundh wrote:
>
> Mike wrote:
> > Surely using a different type on different platforms means that we throw
> > away the concept of a platform independent Unicode string?
> > I.e. on Solaris, wchar_t is 32 bits, on Windows it is 16 bits.
>
> so? the interchange format doesn't have to be
> the same as the internal format, does it?
The interchange format (marshal + pickle) is defined as UTF-8,
so there's no problem with endianness or missing bits w/r to
shipping Unicode data from one platform to another.
> > Does this mean that to transfer a file between a Windows box and Solaris, an
> > implicit conversion has to be done to go from 16 bits to 32 bits (and vice
> > versa)? What about byte ordering issues?
>
> no problem at all: unicode has special byte order
> marks for this purpose (and utf-8 doesn't care, of
> course).
Access to this mark will go into sys: sys.bom.
> > Or do you mean whatever 16 bit data type is available on the platform, with
> > a standard (platform independent) byte ordering maintained?
>
> well, my preference is a 16-bit data type in the plat-
> form's native byte order (exactly how it's done in the
> unicode module -- for the moment, it can use the
> platform's wchar_t, but only if it happens to be a
> 16-bit unsigned type). gives you good performance,
> compact storage, and cleanest possible code.
The 0.4 proposal fixes this to 16-bit unsigned short
using UTF-16 encoding with checks for surrogates. This covers
all defined standard Unicode character points, is fast, etc. pp...
--
Marc-Andre Lemburg
______________________________________________________________________
Y2000: 49 days left
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/