Internal Format (Re: [Python-Dev] Internationalization Toolkit)

M.-A. Lemburg mal@lemburg.com
Wed, 10 Nov 1999 11:03:36 +0100


Fredrik Lundh wrote:
> 
> Guido van Rossum <guido@CNRI.Reston.VA.US> wrote:
> > http://starship.skyport.net/~lemburg/unicode-proposal.txt
> 
> Marc-Andre writes:
> 
>     The internal format for Unicode objects should either use a Python
>     specific fixed cross-platform format <PythonUnicode> (e.g. 2-byte
>     little endian byte order) or a compiler provided wchar_t format (if
>     available). Using the wchar_t format will ease embedding of Python in
>     other Unicode aware applications, but will also make internal format
>     dumps platform dependent.
> 
> having been there and done that, I strongly suggest
> a third option: a 16-bit unsigned integer, in platform
> specific byte order (PY_UNICODE_T).  along all other
> roads lie code bloat and speed penalties...
>
> (besides, this is exactly how it's already done in
> unicode.c and what 'sre' prefers...)

Ok, byte order can cause a speed penalty, so it might be
worthwhile introducing sys.bom (or sys.endianness) for this
reason and sticking to 16-bit integers as you have already done
in unicode.h.

What I don't like is using wchar_t if available (and then addressing
it as if it were defined as unsigned integer). IMO, it's better
to define a Python Unicode representation which then gets converted
to whatever wchar_t represents on the target machine.

Another issue is whether to use UCS2 (as you have done) or UTF16
(which is what Unicode 3.0 requires)... see my other post
for a discussion.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    51 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/