How to turn a string into a list of integers?

Sat Sep 6 08:15:26 EDT 2014

Am 06.09.2014 um 07:47 schrieb Steven D'Aprano <steve+comp.lang.python at pearwood.info>:
> Kurt Mueller wrote:
>> Could someone please explain the following behavior to me:
>> Python 2.7.7, MacOS 10.9 Mavericks

[snip]
Thanks for the detailed explanation. I think I understand a bit better now.

Now the part of the two Python builds is still somewhat unclear to me.

> If you could peer under the hood, and see what implementation Python uses to
> store that string, you would see something version dependent. In Python
> 2.7, you would see an object more or less something vaguely like this:
> 
> [object header containing various fields]
> [length = 2]
> [array of bytes = 0x0041 0x00C4]
> 
> 
> That's for a so-called "narrow build" of Python. If you have a "wide build",
> it will something like this:
> 
> [object header containing various fields]
> [length = 2]
> [array of bytes = 0x00000041 0x000000C4]
> 
> In Python 3.3, "narrow builds" and "wide builds" are gone, and you'll have
> something conceptually like this:
> 
> [object header containing various fields]
> [length = 2]
> [tag = one byte per character]
> [array of bytes = 0x41 0xC4]
> 
> Some other implementations of Python could use UTF-8 internally:
> 
> [object header containing various fields]
> [length = 2]
> [array of bytes = 0x41 0xC3 0x84]
> 
> 
> or even something more complex. But the important thing is, regardless of
> the internal implementation, Python guarantees that a Unicode string is
> treated as a fixed array of code points. Each code point has a value
> between 0 and, not 127, not 255, not 65535, but 1114111.

In Python 2.7:

As I learned from the ord() manual:
If a unicode argument is given and Python was built with UCS2 Unicode,
(I suppose this is the narrow build in your terms),
then the character’s code point must be in the range [0..65535] inclusive;

I understand: In a UCS2 build each character of a Unicode string uses
16 Bits and can represent code points from U-0000..U-FFFF.

From the unichr(i) manual I learn:
The valid range for the argument depends how Python was configured
– it may be either UCS2 [0..0xFFFF] or UCS4 [0..0x10FFFF].

I understand: narrow build is UCS2, wide build is UCS4
- In a UCS2 build each character of an Unicode string uses 16 Bits and has 
  code points from U-0000..U-FFFF (0..65535)
- In a UCS4 build each character of an Unicode string uses 32 Bits and has 
  code points from U-00000000..U-0010FFFF (0..1114111)

Am I right?
-- 
Kurt Mueller, kurt.alfred.mueller at gmail.com