[Python-3000] How will unicode get used?

Jim Jewett jimjjewett at gmail.com
Wed Sep 20 22:59:22 CEST 2006


On 9/20/06, Guido van Rossum <guido at python.org> wrote:
> On 9/20/06, Adam Olsen <rhamph at gmail.com> wrote:
> > On 9/20/06, Guido van Rossum <guido at python.org> wrote:

> > > Let me cut this short. The external string API in Py3k should not
> > > change or only very marginally so (like removing rarely used useless
> > > APIs or adding a few new conveniences).

...

> I thought we were discussing the Python API.

I don't think anyone has proposed much change to strings *as seen from
python*.

At most, there has been an implicit suggestion that the
bytes.decode().encode() dance be shortened.

> C code will likely have the same access to unicode objects as it has in 2.x.

Can C code still assume that

     (1)  the data buffer will always be available for any sort of
direct manipulation (including mutation)

     (2)  in a specific canonical encoding

     (3)  directly from the memory layout, without calling a "prepare"
or "recode" or "encode" method first.

Today, that canonical encoding is a compile-time choice, and any
specific choice causes integration hassles.

Unless the choice matches the system default for text, it also
requires many decode/encode round trips that might otherwise be
avoided.

The proposed changes mostly boil down to removing the third
assumption, and agreeing that some implementations might delay the
decode-to-canonical-format until it was needed.


Rough Summary of new C API restrictions:

Replace
    ((PyStringObject *)string).ob_sval    /* supported today */
with
    PyString_AsString(string)                 /* already recommended */

or replace
    ((PyUnicodeObject *)string)->str       /* supported today */
and
    ((PyUnicodeObject *)string)->defenc    /* supported today */

with
    PyUnicode_AsEncodedString(PyObject *unicode,   /* already recommended */
                              const char *encoding,
                              const char *errors)
and
    PyUnicode_AsAnyString(PyObject *unicode,      /* new */
                          char **encoding,   /* return the actual encoding */
                          const char *errors)

Also note that some macros would need to become functions.  The most
prominent is

    PyUnicode_AS_DATA(string)         /* supports mutation */

-jJ


More information about the Python-3000 mailing list