[Tutor] how to struct.pack a unicode string?

eryksun eryksun at gmail.com
Mon Dec 3 06:56:56 CET 2012


On Sun, Dec 2, 2012 at 8:34 AM, Albert-Jan Roskam <fomcl at yahoo.com> wrote:
>
> As I emailed earlier today to Peter Otten, I thought unicode_internal means
> UCS-2 or UCS-4, depending on the size of sys.maxunicode? How is this related
> to UTF-16 and UTF-32?

UCS is the universal character set. Some highlights of the Basic
Multilingual Plane (BMP): U+0000-U+00FF is Latin-1 (including the C0
and C1 control codes). U+D800-U+DFFF is reserved for UTF-16 surrogate
pairs. U+E000-U+F8FF is reserved for private use. Most of
U+F900-U+FFFF is assigned. Notably U+FEFF (zero width no-break space)
doubles as the BOM/signature in the transformation formats.

UTF-16 encodes the supplementary planes by using 2 codes as a
surrogate pair. This uses a reserved 11-bit block (U+D800-U+DFFF),
which is split into two 10-bit ranges: U+D800-U+DBFF for the lead
surrogate and U+DC00-U+DFFF for the trail surrogate. Together that's
the required 20 bits for the 16 supplementary planes. Including the
BMP, this scheme covers the complete UCS range of 17 * 2**16 ==
1114112 codes (on a wide build, that's sys.maxunicode + 1).

For encoding text, use one of the transformation formats such as
UTF-8, UTF-16, or UTF-32. Unless you have a requirement to use UTF-16
or UTF-32, it's best to stick to encoding to UTF-8. It's the default
encoding in 3.x. It's also generally the most compact representation
(especially if there's a lot of ASCII) and compatible with
null-terminated byte strings (i.e. C array of char, terminated by
NUL). Regardless of narrow vs wide build, you can always encode to one
of these formats. The encoders for UTF-8 and UTF-32 first recombine
any surrogate pairs in the internal representation.

CPython 3.3 has a new implementation that angles for the best of all
worlds, opting for a 1-byte, 2 byte, or 4-byte representation
depending on the maximum code in the string. The internal
representation doesn't use surrogates, so there's no more narrow vs
wide build distinction.


More information about the Tutor mailing list