[Python-3000] string C API

Wed Sep 13 19:09:27 CEST 2006

On 9/13/06, "Martin v. Löwis" <martin at v.loewis.de> wrote:
> > Should encoding be an attribute of the string?

> No. A Python string is a sequence of Unicode characters.
> Even if it was created by converting from some other encoding,
> that original encoding gets lost when doing the conversion
> (just like integers don't remember which base they were originally
> represented in).

Theoretically, it is a sequence of code points.

Today, in python 2.x, these are always represented by a specific
(wide, fixed-width) concrete encoding, chosen at compile time.  This
is required so long as outside code can access the data buffer
directly.

It would no longer be required if all access were through unicode
methods.  (And it would probably make sense to have a
"get-me-the-buffer-in-this-encoding" method.)

Several people seem to want more efficient representations when possible.

Several people seem to want UTF-8, which makes sense if the rest of
the system is UTF8, but complicates the implementation.

Simply not encoding/decoding until required would save quite a bit of
time and space -- but then the object would need some way of
indicating which encoding it is in.

-jJ