[Python-3000] Unicode and OS strings

Wed Sep 19 00:23:18 CEST 2007

On 9/18/07, Guido van Rossum <guido at python.org> wrote:
> On 9/18/07, Jim Jewett <jimjjewett at gmail.com> wrote:
> > On 9/18/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:

> > > There's no UTF-8 in Python's internal string encoding.

> > (At least as of a few days ago)

> > In Python 3 there is; strings are unicode.  A PyUnicodeObject object
> > has two encodings that you can grab from a pointer (which means
> > they have to be there; you don't have time to generate them like
> > you would with a function pointer).

> Incorrect. The pointer can be NULL.

I had missed that comment, but I do see it now; thank you.

> The API for getting the UTF-8 encoding is a function

Thank you.  But given that defenc is now always UTF-8, won't exposing
it in the public typedef then just be an attractive nuisance?

> (moreover a function whose name starts with _Py).

That I still don't see.

http://svn.python.org/view/python/branches/py3k/Include/unicodeobject.h?rev=57656&view=markup

PyAPI_FUNC(PyObject*) PyUnicode_AsUTF8String(
    PyObject *unicode	 	/* Unicode object */
    );

PyAPI_FUNC(PyObject*) PyUnicode_EncodeUTF8(
    const Py_UNICODE *data, 	/* Unicode char buffer */
    Py_ssize_t length,	 	/* number of Py_UNICODE chars to encode */
    const char *errors		/* error handling */
    );

Later, the same file shows me:

/* --- Unicode Type ------------------------------------------------------- */

typedef struct {
    PyObject_HEAD
    Py_ssize_t length;		/* Length of raw Unicode data in buffer */
    Py_UNICODE *str;		/* Raw Unicode buffer */
    long hash;			/* Hash value; -1 if not set */
    int state;			/* != 0 if interned. In this case the two
    				 * references from the dictionary to this object
    				 * are *not* counted in ob_refcnt. */
    PyObject *defenc;		/* (Default) Encoded version as Python
				   string, or NULL; this is used for
				   implementing the buffer protocol */
} PyUnicodeObject;

I would be happier with:

typedef struct {
    PyObject_VAR_HEAD		/* Length in code points, not chars */
} PyUnicodeObject;

And, in unicodeobject.c (*not* in a public header)

typedef struct {
    PyUnicodeObject ob_unicodehead;
    Py_UNICODE *str;		/* Raw Unicode buffer */
    long hash;			/* Hash value; -1 if not set */
    int state;			/* != 0 if interned. In this case the two
    				 * references from the dictionary to this object
    				 * are *not* counted in ob_refcnt. */
    PyObject *defenc;		/* (Default) Encoded version as Python
				   string, or NULL; this is used for
				   implementing the buffer protocol */
} _PyDefaultUnicodeObject;

As this would allow 3rd parties to create implementations specialized
for (and saving space on) smaller alphabets, without breaking C
extensions that stick to the public header files.  (Moving hash or
even state to the public header might be OK too, but they seemed to
get ignored for subclasses anyhow.)

-jJ