[Python-3000] characters data type

Wed May 3 01:42:52 CEST 2006

On 5/2/06, Talin <talin at acm.org> wrote:
> As to whether or not flat arrays are a win over a stream-based interface,
> I think it really depends on how well the array type supports incremental
> growth. I'm mostly familiar with the behavior of STL arrays, which (for
> small array sizes) doubles whenever the array runs out of room. The result
> is that for an array of size N, built up 1 character at a time, you will
> have at most log2(N) memory allocations, and less if you set the initial
> capacity of the array to a reasonable guess.

You should look at list.append in 2.5. It uses a similar O(log N)
approach but doesn't waste up to 50% space for large lists. My plans
would be do something similar for the bytes += implementation.

> The reason that I asked about this was that my current 'toy' implementation
> of the string formatting PEP (which I posted on this list earlier) uses
> character arrays.

Yeah, I was curious why you bothered with making a toy implementation
efficient. :-) You ought to assume that eventually this will be made
efficient; the focus now should be on functionality.

> Although if you are supporting surrogates, then __getitem__ and __getslice__
> won't be O(1), will they? That's why I asked about UCS-2, which is what I
> use in my work -- which is another way of saying that we've punted on the
> issue of surrogates.
There are certain streamwise things (like conversion to/from UTF-8)
that support surrogates, so I would hesitate to call it UCS-2. But
there is no attempt to hide surrogates from __getitem__ or
__getslice__. If you access a code point that's a surrogate, you get a
surrogate. So it's O(1).

--
--Guido van Rossum (home page: http://www.python.org/~guido/)