[Python-3000] characters data type

Wed May 3 00:20:51 CEST 2006

Guido van Rossum <guido <at> python.org> writes:

> On 5/2/06, Talin <talin <at> acm.org> wrote:
> > It appears that my question has been misunderstood by everyone; I'll try
> > to phrase it better:
> >
> > The short version is: will there be a mutable character array type? (which
> > I am calling "characters"?)
> 
> There are no plans for this AFAIK.
> 
> > First, I do use array, not a lot but I do use it occasionally. One common
> > use case is equivalent to the Java StringBuffer class - that is, a means
> > for building up strings a character at a time, which otherwise would be
> > expensive to do with immutable strings.
> 
> Better ways to do this might be [c]StringIO (in theory -- I don't know
> if it's fast enough in practice, but this should be easy to test) or
> the standard "".join(<list of strings>) approach (which underlies
> StringIO's implementation as well -- though not cStringIO's IIRC).

As to whether or not flat arrays are a win over a stream-based interface,
I think it really depends on how well the array type supports incremental
growth. I'm mostly familiar with the behavior of STL arrays, which (for
small array sizes) doubles whenever the array runs out of room. The result
is that for an array of size N, built up 1 character at a time, you will
have at most log2(N) memory allocations, and less if you set the initial
capacity of the array to a reasonable guess.

> > Now, from the discussion of "bytes" I get the impression that it, too, is
> > a mutable type (someone said 'like list').
> 
> Correct. (You have no excuse to guess -- the implementation is in the
> p3yk (sic) branch.)
> 
> > So given that 'characters' (i.e.
> > unicode characters) are now distinct from 'bytes', it makes sense to me to
> > declare a mutable character array. And to me, the most natural name for
> > such a type is 'characters', although I suppose you could also call it
> > "stringbuffer" or something.
> 
> I'm not sure that we need this. I'm not 100% sure we don't either, but
> I believe that the existing approach works pretty well -- it's hard to
> beat the "".join(<list>) approach.

The reason that I asked about this was that my current 'toy' implementation
of the string formatting PEP (which I posted on this list earlier) uses
character arrays.

> > BTW, is the internal encoding of unicode strings UTF-8, UTF-16, UCS-2, or
> > UTF-32? Just wondering...
> 
> Again, please look at the implementation. It's an array of shortish
> ints; whether these are 2 or 3 or 4 bytes long depends on config
> constants. (Just kidding about the 3 bytes version.  I think that
> makes it UTF-16 or UTF-32. There is some support for surrogates when 2
> bytes are used so I believe that excludes UCS-2.
>
> Note that UTF-8 would make the implementation of Python's typical
> string API painful; we currently assume (because it's true  that
> random access to elements and slices (__getitem__ and __getslice__) is
> O(1). With UTF-8 these operations would be slow -- the simplest
> implementation would require counting characters from the start; one
> can speed this up with some kind of cache or tree but IMO the
> array-of-fixed-width-characters approach is much simpler. (I had a bad
> experience in my youth with strings implemented as trees, so I'm
> biased against complicated string implementations. This also explains
> why I'm no fan of the oft-proposed idea that slices should avoid
> making physical copies even if they make logical copies -- the
> complexity of that approach horrifies me.)

Although if you are supporting surrogates, then __getitem__ and __getslice__
won't be O(1), will they? That's why I asked about UCS-2, which is what I
use in my work -- which is another way of saying that we've punted on the
issue of surrogates.

-- Talin