How to turn a string into a list of integers?

Sun Sep 7 10:52:28 EDT 2014

On 2014-09-07 02:47, Steven D'Aprano wrote:
> Kurt Mueller wrote:
>
>> Processing any Unicode string will work with small and wide
>> python 2.7 builds and also with python >3.3?
>> ( parts of small build python will not work with values over 0xFFFF )
>> ( strings with surrogate pairs will not work correctly on small build
>> python )
>
>
> If you limit yourself to code points in the Basic Multilingual Plane, U+0000
> to U+FFFF, then Python's Unicode handling works fine no matter what version
> or implementation is used. Since most people use only the BMP, you may not
> notice any problems.
>
> (Of course, there are performance and memory-usage differences from one
> version to the next, but the functionality works correctly.)
>
> If you use characters from the supplementary planes ("astral characters"),
> then:
>
> * wide builds will behave correctly;
> * narrow builds will wrongly treat astral characters as two
>    independent characters, which means functions like len()
>    and string slicing will do the wrong thing;
> * Python 3.3 doesn't use narrow and wide builds any more,
>    and also behaves correctly with astral characters.
>
>
> So there are three strategies for correct Unicode support in Python:
>
> * avoid astral characters (and trust your users will also avoid them);
>
> * use a wide build;
>
> * use Python 3.3 or higher.
>
>
> In case you are wondering what Python 3.3 does differently, when it builds a
> string, it works out the largest code point in the string. If the largest
> code point is no greater than U+00FF, it stores the string in Latin 1 using
> 8 bits per character; if the largest code point is no greater than U+FFFF,
> then it uses UTF-16 (or UCS-2, since with the BMP they are functionally the
> same); if the string contains any astral characters, then it uses UTF-32.
> So regardless of the string, each character uses a single code unit. Only
> the size of the code unit varies.
>
I don't think you should be saying that it stores the string in Latin-1
or UTF-16 because that might suggest that they are encoded. They aren't.