python-unicode doesn't support >65535 symbols?

Andrew Clover and-google at doxdesk.com
Thu Nov 27 12:46:51 EST 2003


gabor <gabor at z10n.net> wrote:

> so text[3] (which should be \U00010330),
> was split to 2 16bit values (text[3] and text[4]).

The default encoding for native Unicode strings in Python in UTF-16, which
cannot hold the extended planes beyond 0xFFFF in a single character. Instead,
it uses two 'surrogate' characters. Bit of a nasty hack, but that's what
Unicode does and there's nothing can be done about it now.

Python can be compiled to use UCS-4 for native Unicode strings if you prefer.
Then every conceptual 'character' from the Unicode repertoire will be one
item long. It'll eat more memory too of course.

> if tthe representation of 'text' is correct, why is the length wrong?

The representation of 'text' you are seeing is just the nicely-readable
version output by Python 2.2+. Despite the \U sequence, it is actually still
stored internally as two UTF-16 surrogates. You'll see this if you enter
'\U00012345' into Python 2.0 or 2.1, which don't use the \U form to output
strings.

-- 
Andrew Clover
mailto:and at doxdesk.com
http://www.doxdesk.com/




More information about the Python-list mailing list