[Python-Dev] PyUnicodeObject / PyASCIIObject questions

Jim Jewett jimjjewett at gmail.com
Tue Dec 13 08:09:02 CET 2011


(see http://www.python.org/dev/peps/pep-0393/ and
http://hg.python.org/cpython/file/6f097ff9ac04/Include/unicodeobject.h
)


	typedef struct {
	  PyObject_HEAD
	  Py_ssize_t length;
	  Py_hash_t hash;
	  struct {
		  unsigned int interned:2;
		  unsigned int kind:2;   /* now 3 in implementation */
		  unsigned int compact:1;
		  unsigned int ascii:1;
		  unsigned int ready:1;
	  } state;
	  wchar_t *wstr;
	} PyASCIIObject;

	typedef struct {
	  PyASCIIObject _base;
	  Py_ssize_t utf8_length;
	  char *utf8;
	  Py_ssize_t wstr_length;
	} PyCompactUnicodeObject;

	typedef struct {
	  PyCompactUnicodeObject _base;
	  union {
		  void *any;
		  Py_UCS1 *latin1;
		  Py_UCS2 *ucs2;
		  Py_UCS4 *ucs4;
	  } data;
	} PyUnicodeObject;

(1)  Why is PyObject_HEAD used instead of PyObject_VAR_HEAD?  It is
because of the names (.length vs .size), or a holdover from when
unicode (as opposed to str) did not expect to be compact, or is there
a deeper reason?

(2)  Why does PyASCIIObject have a wstr member, and why does
PyCompactUnicodeObject have wstr_length?  As best I can tell from the
PEP or header file, wstr is only meaningful when either:

    (2a)  wstr is shared with (and redundant to) the canonical representation
         -- which will therefore not be ASCII.  So wstr (and
wstr_length) shouldn't need to be
        represented explicitly, and certainly not in the PyASCIIObject base.

or

    (2b)  The string is a "Legacy String" (and PyUnicode_READY has not
been called).  Because
        it is a Legacy String, the object header must already be a
full PyUnicodeObject, and the wstr
        fields could at least be stored there.

        I'm also not sure why wstr can't be stored in the existing
.data member -- once PyUnicode_READY
        is called, it will either be there (shared) or be discarded.

        Are there other times when the wstr will be explicitly
re-filled and cached?

(3)  I would feel much less nervous if the remaining 4 values of
PyUnicode_Kind were explicitly reserved, and the macros raised an
error when they showed up.  (Better still would be to allow other
values, and to have the macros delegate to some attribute on the (sub)
type object.)

Discussion on py-ideas strongly suggested that people should not be
rolling their own string string representations, and that it won't
really save as much as people think it will, etc ... but I'm not sure
that saying "do it without inheritance" is the best solution -- and
that is what treating kind as an exhaustive list does.

-jJ


More information about the Python-Dev mailing list