[Python-Dev] len(chr(i)) = 2?

M.-A. Lemburg mal at egenix.com
Mon Nov 22 19:53:00 CET 2010


Raymond Hettinger wrote:
> Any explanation we give users needs to let them know two things:
> * that we cover the entire range of unicode not just BMP
> * that sometimes len(chr(i)) is one and sometimes two
> 
> The term UCS-2 is a complete communications failure
> in that regard.  If someone looks up the term, they will
> immediately see something like the wikipedia entry which says,
> "UCS-2 cannot represent code points outside the BMP".
> How is that helpful?

It's very helpful, since it explains why a UCS-2 build of Python
requires a surrogates pair to represent a non-BMP code point
and explains why chr(i) gives you a length 2 string rather than
a length 1 string.

A UCS-4 build does not need to use surrogates for this, hence
you get a length 1 string from chr(i).

There are two levels we have to explain to users:

1. the transfer level

2. the storage level

The UTF encodings address the transfer level and is what
you deal with in I/O. These provide variable length encodings of
the complete Unicode code point range, regardless of whether
you have a UCS-2 or a UCS-4 build.

The storage level becomes important if you want to work on
strings using indexing and slicing. Here you do have to know
whether you're dealing with a UCS-2 or a UCS-4 build, since the
indexes will vary if you're using non-BMP code points.

Finally, to tie both together, we have to explain that UTF-16
(the transfer encoding) maps to UCS-2 in a straight-forward way,
so it is possible to work with a UCS-2 build of Python and still
use the complete Unicode code point range - you only have to
take into consideration, that Python's string indexing will not
necessarily point you to n-th code point in a string, but may
well give you half or a surrogate.

Note that while that last aspect may appear like a good argument
for UCS-4 builds, in reality it is not. UCS-4 has the same
issue on a different level: the letters that get printed on
the screen or printer (graphemes) may well be made up of
multiple combining code points, e.g. an "e" and an "´".
Those again map to two indexes in the Python string, even
though, the appear to be one character on output.

Now try to explain all of the above using the terms "narrow"
and "wide" (while remembering "explicit is better than implicit"
and "avoid the temptation to guess") :-)

It is not really helpful to replace a correct and accurate
term with a fuzzy term: either way we're stuck with the
semantics.

However, the correct and accurate terms at least give
you a chance to figure out and understand the reasoning
behind the design. UCS-2 vs. UCS-4 is a trade-off, "narrow"
and "wide" is marketing talk with an implicit emphasis on
one side :-)

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Nov 22 2010)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/


More information about the Python-Dev mailing list