[Python-Dev] thoughts on the bytes/string discussion

Wed Jul 7 08:06:36 CEST 2010

Ronald Oussoren, 06.07.2010 16:51:
> On 27 Jun, 2010, at 11:48, Greg Ewing wrote:
>
>> Stefan Behnel wrote:
>>> Greg Ewing, 26.06.2010 09:58:
>>>> Would there be any sanity in having an option to compile Python
>>>> with UTF-8 as the internal string representation?
>>> It would break Py_UNICODE, because the internal size of a unicode
>>> character would no longer be fixed.
>>
>> It's not fixed anyway with the 2-char build -- some characters are
>> represented using a pair of surrogates.
>
> It is for practical purposes not even fixed in 4-char builds. In 4-char
> builds every Unicode code points corresponds to one item in a python
> unicode string, but a base characters with combining characters is still
> a sequence of characters and should IMHO almost always be treated as a
> single object. As an example, given s="be\N{COMBINING DIAERESIS}" s[:2]
> or s[2:] is almost certainly semanticly invalid.

Sure. However, this is not a problem for the purpose of the C-API, 
especially for Cython (which is the angle from which I brought this up). 
All Cython cares about is that it mimics CPython's sematics excactly when 
transforming code, and a CPython runtime will ignore surrogate pairs and 
combining characters during iteration and indexing, and when determining 
the string length. So a single character unicode string can currently be 
safely aliased by Py_UNICODE with correct Python semantics. That would no 
longer be the case if the internal representation switched to UTF-8 and/or 
if CPython started to take surrogates and combining characters into account 
when considering the string length.

Note that it's impossible to determine if a unicode string contains 
surrogate pairs because it's running on a narrow unicode build or because 
the user entered them into the string. But the user would likely expect the 
second case to treat them as separate code points, whereas the first is an 
implementation detail that should normally be invisible. Combining 
characters are a lot clearer here, as they can only be entered by users, so 
keeping them separate as provided is IMHO the expected behaviour.

I think the main theme here is that the interpretation of code points and 
their transformation for user interfaces and backends is left to the user 
code. Py_UNICODE represents a code point in the current system, including 
surrogate pair 'escapes'. And that would change if the underlying encoding 
switched to something other than UTF-16/UCS-4.

Stefan