python-unicode doesn't support >65535 symbols?

gabor gabor at z10n.net
Sun Nov 30 17:30:40 EST 2003


On Thu, 2003-11-27 at 18:46, Andrew Clover wrote:
> gabor <gabor at z10n.net> wrote:
> 
> > so text[3] (which should be \U00010330),
> > was split to 2 16bit values (text[3] and text[4]).
> 
> The default encoding for native Unicode strings in Python in UTF-16, which
> cannot hold the extended planes beyond 0xFFFF in a single character. Instead,
> it uses two 'surrogate' characters. Bit of a nasty hack, but that's what
> Unicode does and there's nothing can be done about it now.

does that mean that python when compiled in utf-16 mode, uses
surrogates?

then it should also correctly tell me that the length is 9, not 10,
don't you think?

as i see there are 2 possibilities:


1. python, when compiled for narrow-unicode, uses surrogates to encode
non-plane0 characters in utf16. if that's true, python has a bug,
because in my example text[3] should be what i wrote, and length should
also work correctly.

or

2. python, when compiled for narrow-unicode, doesn't work with
characters outside plane0. if that's true, i would expect python to at
least tell me, throw an exception for example, if i try to decode for
example an utf8 string, with non-plane-0 characters.


what do you think?
gabor






More information about the Python-list mailing list