[Python-Dev] thoughts on the bytes/string discussion

Sat Jun 26 19:24:50 CEST 2010

Greg Ewing writes:

 > Would there be any sanity in having an option to compile
 > Python with UTF-8 as the internal string representation?

Losing Py_UNICODE as mentioned by Stefan Behnel (IIRC) is just the
beginning of the pain.

If Emacs's experience is any guide, the cost in speed and complexity
of a variable-width internal representation is high.  There are a
number of tricks you can use, but basically everything becomes O(n)
for the natural implementation of most operations (such as indexing by
character).  You can get around that with a position cache, of course,
but that adds complexity, and really cuts into the space saving (and
worse, adds another chunk that may or may not be paged in when you
need it).

What we're considering is a system where buffers come in 1-, 2-, and
4-octet widechars, with automatic translation depending on content.
But the buffer is the primary random-access structure in Emacsen, so
optimizing it is probably worth our effort.  I doubt it would be worth
it for Python, but my intuitions here are not reliable.