[Python-3000] How will unicode get used?

Wed Sep 20 17:50:25 CEST 2006

"Adam Olsen" <rhamph at gmail.com> wrote:
> Before we can decide on the internal representation of our unicode
> objects, we need to decide on their external interface.  My thoughts
> so far:

I believe the only options up for actual decision is what the internal
representation of a unicode object will be.  Utf-8 that is never changed? 
Utf-8 that is converted to ucs-2/4 on certain kinds of accesses? 
Latin-1/ucs-2/ucs-4 depending on code point content?  Always ucs-2/4,
depending on compiler switch?

> * Most transformation and testing methods (.lower(), .islower(), etc)
> can be copied directly from 2.x.  They require no special
> implementation to perform reasonably.

A decoding variant of these would be required if the underlying
representation of a particular string is not latin-1, ucs-2, or ucs-4.

Further, any rstrip/split/etc. methods need to scan/parse the entire
string in order to discover code point starts/ends when using a utf-*
variant as an internal encoding (except for utf-32, which has a constant
width per character).

Whether or not we choose to go with a varying internal representation 
(the latin-1/ucs-2/ucs-4 variant I have been suggesting), 

> * Indexing and slicing is the big issue.  Do we need constant-time
> integer slicing?  .find() could be changed to return a token that
> could be used as a constant-time offset.  Incrementing the token would
> have linear costs, but that's no big deal if the offsets are always
> small.

If by "constant-time integer slicing" you mean "find the start and end
memory offsets of a slice in constant time", I would say yes.

Generally, I think tokens (in unicode strings) are a waste of time and
implementation.  Giving each string a fixed-width per character allows
methods on those unicode strings to be far simpler in implementation.

> * Grapheme clusters, words, lines, other groupings, do we need/want
> ways to slice based on them too?

No.

> * Cheap slicing and concatenation (between O(1) and O(log(n))), do we
> want to support them?  Now would be the time.

This would imply a tree-based string, which Guido has specifically
stated would not happen.  Never mind that it would be a beast to
implement and maintain or that it would exclude the possibility for
offering the single-segment buffer interface, without reprocessing.

 - Josiah