Building CPython

Steven D'Aprano steve+comp.lang.python at pearwood.info
Fri May 15 08:10:15 EDT 2015


On Fri, 15 May 2015 08:52 pm, Marko Rauhamaa wrote:

> wxjmfauth at gmail.com:
> 
>> Le vendredi 15 mai 2015 11:20:25 UTC+2, Marko Rauhamaa a écrit :
>>> wxjmfauth at gmail.com:
>>> 
>>> > Implement unicode correctly.
>>> Did they reject your patch?
>>
>> You can not patch something that is wrong by design.
> 
> Are you saying the Python language spec is unfixable or that the CPython
> implementation is unfixable?

JMF is obsessed with a trivial and artificial performance regression in the
handling of Unicode strings since Python 3.3, which introduced a
significant memory optimization for Unicode strings. Each individual string
uses a code unit no larger than necessary, thus if a string contains
nothing but ASCII or Latin 1 characters, it will use one byte per
character; if it fits into the Basic Multilingual Plane, two bytes per
character; and only use four bytes per character if there are "astral"
characters in the string.

(That is, Python strings select from a Latin-1, UCS-2 and UTF-32 encoded
form at creation time, according to the largest code point in the string.)

The benefit of this is that most strings will use 1/2 or 1/4 of the memory
that they otherwise would need, which gives an impressive memory saving.
That leads to demonstrable speed-ups in real-world code, however it is
possible to find artificial benchmarks that experience a slowdown compared
to Python 3.2.

JMF found one such artificial benchmark, involving creating and throwing
away many strings as fast as possible without doing any work with them, and
from this has built this fantasy in his head that Python is not compliant
with the Unicode spec and is logically, mathematically broken.


-- 
Steven




More information about the Python-list mailing list