[Python-3000] PEP Draft: Enhancing the buffer protcol

Wed Feb 28 19:55:21 CET 2007

Travis Oliphant <oliphant.travis at ieee.org> wrote:
> I think you are right.  In the discussions for unifying string/unicode I 
> really like the proposals that are leaning toward having a unicode 
> object be an immutable string of either ucs-1, ucs-2, or ucs-4 depending 
> on what is in the string.

Except that its not going to happen.  The width of the unicode
representation is going to be fixed at compile time, generally utf-16 or
ucs-4.  I say utf-16 because the representation allows for surrogate
pairs, etc., but each value of the pair are considered a "character",
where as (according to my potentially flawed memory of reading the spec)
ucs-2 doesn't allow for surrogates.

Note that I previously offered an overlay structure that could support
the O(logn) time access of arbitrary full characters regardless of
encoding (utf-8, utf-16 or ucs-4) using O(logn) space, but it was
decided by Guido that Python should return partial character (half of a
surrogate pair) rather than offer non-constant character access time.*

 - Josiah

* As a side note, the space and time is really a function of how often
surrogates or their equivalent in utf-8, etc., occurred.  In worst-case
O(logn) for both, but is actually a function of the structure of
occurrances of the non-constant character lengths.