[Python-Dev] PEP 393: Flexible String Representation

Mon Jan 24 23:12:33 CET 2011

On Mon, 24 Jan 2011 21:17:34 +0100
"Martin v. Löwis" <martin at v.loewis.de> wrote:
> I have been thinking about Unicode representation for some time now.
> This was triggered, on the one hand, by discussions with Glyph Lefkowitz
> (who complained that his server app consumes too much memory), and Carl
> Friedrich Bolz (who profiled Python applications to determine that
> Unicode strings are among the top consumers of memory in Python).
> On the other hand, this was triggered by the discussion on supporting
> surrogates in the library better.
> 
> I'd like to propose PEP 393, which takes a different approach,
> addressing both problems simultaneously: by getting a flexible
> representation (one that can be either 1, 2, or 4 bytes), we can
> support the full range of Unicode on all systems, but still use
> only one byte per character for strings that are pure ASCII (which
> will be the majority of strings for the majority of users).

For this kind of experiment, I think a concrete attempt at implementing
(together with performance/memory savings numbers) would be much more
useful than an abstract proposal. It is hard to judge the concrete
effects of the changes you are proposing, even though they might (or
not) make sense in theory. For example, you are adding a lot of
constant overhead to every unicode object, even very small ones, which
might be detrimental. Also, accessing the unicode object's payload
can become quite a bit more cumbersome. Only implementing can tell how
much this is workable in practice.

Regards

Antoine.