How to turn a string into a list of integers?

Steven D'Aprano steve+comp.lang.python at pearwood.info
Sat Sep 6 21:47:07 EDT 2014


Kurt Mueller wrote:

> Processing any Unicode string will work with small and wide
> python 2.7 builds and also with python >3.3?
> ( parts of small build python will not work with values over 0xFFFF )
> ( strings with surrogate pairs will not work correctly on small build
> python )


If you limit yourself to code points in the Basic Multilingual Plane, U+0000
to U+FFFF, then Python's Unicode handling works fine no matter what version
or implementation is used. Since most people use only the BMP, you may not
notice any problems.

(Of course, there are performance and memory-usage differences from one
version to the next, but the functionality works correctly.)

If you use characters from the supplementary planes ("astral characters"),
then:

* wide builds will behave correctly;
* narrow builds will wrongly treat astral characters as two 
  independent characters, which means functions like len() 
  and string slicing will do the wrong thing;
* Python 3.3 doesn't use narrow and wide builds any more,
  and also behaves correctly with astral characters.


So there are three strategies for correct Unicode support in Python:

* avoid astral characters (and trust your users will also avoid them);

* use a wide build;

* use Python 3.3 or higher.


In case you are wondering what Python 3.3 does differently, when it builds a
string, it works out the largest code point in the string. If the largest
code point is no greater than U+00FF, it stores the string in Latin 1 using
8 bits per character; if the largest code point is no greater than U+FFFF,
then it uses UTF-16 (or UCS-2, since with the BMP they are functionally the
same); if the string contains any astral characters, then it uses UTF-32.
So regardless of the string, each character uses a single code unit. Only
the size of the code unit varies.



-- 
Steven




More information about the Python-list mailing list