[Python-3000] How will unicode get used?

Wed Sep 20 20:20:13 CEST 2006

On 9/20/06, Guido van Rossum <guido at python.org> wrote:
> On 9/20/06, Adam Olsen <rhamph at gmail.com> wrote:
> > Before we can decide on the internal representation of our unicode
> > objects, we need to decide on their external interface.  My thoughts
> > so far:
>
> Let me cut this short. The external string API in Py3k should not
> change or only very marginally so (like removing rarely used useless
> APIs or adding a few new conveniences). The plan is to keep the 2.x
> API that is supported (in 2.x) by both str and unicode, but merge the
> twp string types into one. Anything else could be done just as easily
> before or after Py3k.

Thanks, but one thing remains unclear: is the indexing intended to
represent bytes, code points, or code units?  Note that C code
operating on UTF-16 would use code units for slicing of UTF-16, which
splits surrogate pairs.

As far as I can tell, CPython on windows uses UTF-16 with code units.
Perhaps not intentionally, but by default (not throwing an error on
surrogates).

For those trying to make sense of this, a Code Point anything in the 0
to 0x10FFFF range.  A Code Unit goes up to 0xFF for UTF-8, 0xFFFF for
UTF-16, and 0xFFFFFFFF for UTF-32.  One or more code units may be
needed to form a single code point.  Obviously code units expose our
internal implementation choice.

-- 
Adam Olsen, aka Rhamphoryncus