[Python-3000] How will unicode get used?

Sun Sep 24 23:45:36 CEST 2006

"Martin v. Löwis" <martin at v.loewis.de> wrote:
> Josiah Carlson schrieb:
> > For me, having recently remembered what was in a unicode string, and
> > verifying it by checking the source, the question in my mind is whether
> > we want to stick with the same 2-representation implementation (default
> > encoding and UTF-16 or UCS-4 depending on build), or go with more or
> > fewer representations.
> 
> I would personally like to see a Python API that operates on code
> points, with support for 17 planes. I also think that efficient indexing
> is important.

Fully-featured unicode would be nice.

> There are trade-offs, of course. I personally think the best trade-off
> would be to have a two-byte representation, along with a flag telling
> whether there are any surrogate pairs in the string. Indexing and
> length would be constant-time if there are no surrogates, and linear
> time if there are.

What about a tree structure over the top of the string as I described in
another post?  If there are no surrogate pairs, the pointer to the tree
is null.  If there are surrogate pairs, we could either use the
structure as I described, or even modify it so that we get even better
memory utilization/performance (choose tree nodes based on where
surrogate pairs are, up to some limit).

 - Josiah