[Python-Dev] PEP 393: Flexible String Representation

Thu Jan 27 06:50:30 CET 2011

On Mon, Jan 24, 2011 at 3:20 PM, Antoine Pitrou <solipsis at pitrou.net> wrote:
> Le mardi 25 janvier 2011 à 00:07 +0100, "Martin v. Löwis" a écrit :
>> >> I'd like to propose PEP 393, which takes a different approach,
>> >> addressing both problems simultaneously: by getting a flexible
>> >> representation (one that can be either 1, 2, or 4 bytes), we can
>> >> support the full range of Unicode on all systems, but still use
>> >> only one byte per character for strings that are pure ASCII (which
>> >> will be the majority of strings for the majority of users).
>> >
>> > For this kind of experiment, I think a concrete attempt at implementing
>> > (together with performance/memory savings numbers) would be much more
>> > useful than an abstract proposal.
>>
>> I partially agree. An implementation is certainly needed, but there is
>> nothing wrong (IMO) with designing the change before implementing it.
>> Also, several people have offered to help with the implementation, so
>> we need to agree on a specification first (which is actually cheaper
>> than starting with the implementation only to find out that people
>> misunderstood each other).
>
> I'm not sure it's really cheaper. When implementing you will probably
> find out that it makes more sense to change the meaning of some fields,
> add or remove some, etc. You will also want to try various tweaks since
> the whole point is to lighten the footprint of unicode strings in common
> workloads.

Yep.  This is only a proposal, an implementation will allow all of
that to be experimented with.

I have frequently see code today, even in python 2.x, that suffers
greatly from unicode vs string use (due to APIs in some code that were
returning unicode objects unnecessarily when the data was really all
ascii text).  python 3.x only increases this as the default for so
many things passes through unicode even for programs that may not need
it.

>
> So, the only criticism I have, intuitively, is that the unicode
> structure seems to become a bit too large. For example, I'm not sure you
> need a generic (pointer, size) pair in addition to the
> representation-specific ones.

I believe the intent this pep is aiming at is for the existing in
memory structure to be compatible with already compiled binary
extension modules without having to recompile them or change the APIs
they are using.

Personally I don't care at all about preserving that level of binary
compatibility, it has been convenient in the past but is rarely the
right thing to do.  Of course I'd personally like to see PyObject
nuked and revisited, it is too large and is probably not cache line
efficient.

>
> Incidentally, to slightly reduce the overhead the unicode objects,
> there's this proposal: http://bugs.python.org/issue1943

Interesting.  But that aims more at cpu performance than memory
overhead.  What I see is programs that predominantly process ascii
data yet waste memory on a 2-4x data explosion of the internal
representation.  This PEP aims to address that larger target.

-gps