[Tutor] how are unicode chars represented?

Kent Johnson kent37 at tds.net
Tue Mar 31 12:57:39 CEST 2009


On Tue, Mar 31, 2009 at 1:52 AM, Mark Tolonen <metolone+gmane at gmail.com> wrote:

> Unicode is simply code points.  How the code points are represented
> internally is another matter.  The below code is from a 16-bit Unicode build
> of Python but should look exactly the same on a 32-bit Unicode build;
> however, the internal representation is different.
>
> Python 2.6.1 (r261:67517, Dec  4 2008, 16:51:00) [MSC v.1500 32 bit (Intel)]
> on win32
> Type "help", "copyright", "credits" or "license" for more information.
>>>>
>>>> x=u'\U00012345'
>>>> x.encode('utf8')
>
> '\xf0\x92\x8d\x85'
>
> However, I wonder if this should be considered a bug.  I would think the
> length of a Unicode string should be the number of code points in the
> string, which for my string above should be 1.  Anyone have a 32-bit Unicode
> build of Python handy?  This exposes the implementation as UTF-16.
>>>>
>>>> len(x)
>
> 2
>>>>
>>>> x[0]
>
> u'\ud808'
>>>>
>>>> x[1]
>
> u'\udf45'

In standard Python the representation of unicode is 16 bits, without
correct handling of surrogate pairs (which is what your string
contains). I think this is called UCS-2, not UTF-16.

There is a a compile switch to enable 32-bit representation of
unicode. See PEP 261 and the "Internal Representation" section of the
second link below for more details.
http://www.python.org/dev/peps/pep-0261/
http://www.cmlenz.net/archives/2008/07/the-truth-about-unicode-in-python

Kent


More information about the Tutor mailing list