[Python-3000] characters data type

Wed May 3 08:47:26 CEST 2006

Guido van Rossum wrote:

> Note that UTF-8 would make the implementation of Python's typical
> string API painful; we currently assume (because it's true ;-) that
> random access to elements and slices (__getitem__ and __getslice__) is
> O(1). With UTF-8 these operations would be slow -- the simplest
> implementation would require counting characters from the start; one
> can speed this up with some kind of cache or tree but IMO the
> array-of-fixed-width-characters approach is much simpler. (I had a bad
> experience in my youth with strings implemented as trees, so I'm
> biased against complicated string implementations.

I'm still thinking that it might be a good idea to (optionally) delay de-
coding of strings until you're actually doing something that needs access
to the individual characters, though.  (UTF-8 to UTF-8 shuffling is an
increasingly common use case).

(frankly, I wouldn't rule out using an "internally polymorphic" representation
for the new str type, partially motivated by my experiences from cElement-
Tree).

> This also explains why I'm no fan of the oft-proposed idea that slices
> should avoid making physical copies even if they make logical copies --
> the complexity of that approach horrifies me.)

that could also be an optional mechanism for advanced users, but I agree
that it needs a simple implementation.

I think some experimentation is required here (and hope to find some time
for that in a not very distant future).

</F>