How to turn a string into a list of integers?

Steven D'Aprano steve+comp.lang.python at pearwood.info
Sat Sep 6 14:19:52 EDT 2014


Kurt Mueller wrote:

[...]
> Now the part of the two Python builds is still somewhat unclear to me.
[...]
> In Python 2.7:
> 
> As I learned from the ord() manual:
> If a unicode argument is given and Python was built with UCS2 Unicode,

Where does the manual mention UCS-2? As far as I know, no version of Python
uses that.


> (I suppose this is the narrow build in your terms),

Mostly right, but not quite. "Narrow build" means that Python uses UTF-16,
not UCS-2, although the two are very similar. See below for further
details. But to make it more confusing, *parts* of Python (like the unichr
function) assume UCS-2, and refuse to accept values over 0xFFFF.


> then the character’s code point must be in the range [0..65535] inclusive;

Half-right. Unicode code points are always in the range U+0000 to U+10FFFF,
or in decimal, [0...1114111]. But, Python "narrow builds" don't quite
handle that correctly, and only half-support code points from
[65536...1114111]. The reasons are complicated, but see below.

UCS-2 is an implementation of an early, obsolete version of Unicode which is
limited to just 65536 characters (technically: "code points") instead of
the full range of 1114112 characters supported by Unicode.

UCS-2 is very similar to UTF-16. Both use a 16-bit "code unit" to represent
characters. In UCS-2, each character is represented by precisely 1 code
unit, numbered between 0 and 65535 (0x0000 and 0xFFFF in hex). In UTF-16,
the most common characters (the Basic Multilingual Plane) are likewise
represented by 1 code unit, between 0 and 65535, but there are a range
of "characters" (actually code points) which are reserved for use as
so-called "surrogate pairs". Using hex:

Code points U+0000 to U+D7FF:
    - represent the same character in UCS-2 and UTF-16;

Code points U+D800 to U+DFFF:
    - represent reserved but undefined characters in UCS-2;
    - represent surrogates in UTF-16 (see below);

Code points U+E000 to U+FFFF:
    - represent the same character in UCS-2 and UTF-16;

Code points U+010000 to U+10FFFF:
    - impossible to represent in UCS-2;
    - represented by TWO surrogates in UTF-16.

For example, the Unicode code point U+1D11E (MUSICAL SYMBOL G CLEF) cannot
be represented at all in UCS-2, because it is past U+FFFF. In UTF-16, it
cannot be represented as a single 16-bit code unit, instead it is
represented as two code-units, 0xD834 0xDD1E. That is called a "surrogate
pair".

The problem with Python's narrow builds is that, although characters are
variable width (the most common are 1 code unit, 16 bits, the rest are 2
code units), the Python implementation assumes that all characters are a
fixed 16 bits. So if your string is a single character like U+1D11E,
instead of treating it as a string of length one with ordinal value
0x1D11E, Python will treat it as a string of length *two* with ordinal
values 0xD834 and 0xDD1E.

(In other words, Python narrow builds fail to deal with surrogate pairs
correctly.)

Although you cannot create that string using unichr, you can create it using
the \U notation:

py> unichr(0x1D11E)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: unichr() arg not in range(0x10000) (narrow Python build)
py> u'\U0001D11E'
u'\U0001d11e'


> I understand: In a UCS2 build each character of a Unicode string uses
> 16 Bits and can represent code points from U-0000..U-FFFF.

That is correct. So UCS-2 can only represent a small subset of Unicode.


> From the unichr(i) manual I learn:
> The valid range for the argument depends how Python was configured
> – it may be either UCS2 [0..0xFFFF] or UCS4 [0..0x10FFFF].
> I understand: narrow build is UCS2, wide build is UCS4

UCS-4 is exactly the same as UTF-32, and wide builds use a fixed 32 bits for
every code point, so that's correct.


> - In a UCS2 build each character of an Unicode string uses 16 Bits and has
>   code points from U-0000..U-FFFF (0..65535)

As I said, it's not strictly correct, Python is actually using UTF-16, but
it's a buggy or incomplete UTF-16, with parts of the system assuming UCS-2.


> - In a UCS4 build each character of an Unicode string uses 32 Bits and has
>   code points from U-00000000..U-0010FFFF (0..1114111)

Correct. Remember that UCS-4 and UTF-32 are exactly the same: every code
point from U+0000 to U+10FFFF is represented by a single 32-bit value. So
our earlier example, U+1D11E (MUSICAL SYMBOL G CLEF) would be represented
as 0x0001D11E in UTF-32 and UCS-4.

Remember, though, these internal representations are (nearly) irrelevant to
Python code. In Python code, you just consider that a Unicode string is an
array of ordinal values from 0x0 to 0x10FFFF, each representing a single
code point U+0000 to U+10FFFF. The only reason I say "nearly" is that
narrow builds don't *quite* work right if the string contains surrogate
pairs.



-- 
Steven




More information about the Python-list mailing list