String performance regression from python 3.2 to 3.3

Steven D'Aprano steve+comp.lang.python at pearwood.info
Sat Mar 16 04:47:06 EDT 2013


On Fri, 15 Mar 2013 21:26:28 -0700, rusi wrote:

> The unicode standard is language-agnostic. Unicode implementations exist
> withing a language x implementation x C- compiler implementation x …  --
> Notice the gccs in Andriy's comparison. Do they signify?

They should not. Ideally, the behaviour of Python should be identical 
regardless of the compiler used to build the Python interpreter.

In practice, this is not necessarily the case. One compiler might 
generate more efficient code than another. But aside from *performance*, 
the semantics of what Python does should be identical, except where noted 
as "implementation dependent".


> The number of actual python implementations is small -- 2.7, 3.1, 3.2,
> 3.3 -- at most enlarged with wides and narrows; The number of possible
> implementations is large (in principle infinite)

IronPython and Jython will, if I understand correctly, inherit their 
string implementations from .Net and Java. 


> -- a small example of a point in design-space that is not explored: eg
> 
> There are 17 planes x 2^16 chars in a plane < 32 x 2^16
> = 2^5 x 2^16
> = 2^21
> 
> ie wide unicode (including the astral planes) can fit into 21 bits ie 3
> wide-chars can fit into 64 bit slot rather than 2. Is this option worth
> considering? Ive no idea and I would wager that no one does until some
> trials are done

As I understand it, modern CPUs and memory chips are optimized for 
dealing with either two things:

- single bytes;

- even numbers of bytes, e.g. 16 bits, 32 bits, 64 bits, ...

but not odd numbers of bytes, e.g. 24 bits, 40 bits, 72 bits, ...

So while you might save memory by using "UTF-24" instead of UTF-32, it 
would probably be slower because you would have to grab three bytes at a 
time instead of four, and the hardware probably does not directly support 
that.



-- 
Steven



More information about the Python-list mailing list