[Python-3000] string C API

Tue Oct 3 19:41:27 CEST 2006

On 10/3/06, "Martin v. Löwis" <martin at v.loewis.de> wrote:
> Jim Jewett schrieb:
> > In python 3, a string object might look like

> > #define PyObject_str_HEAD   \
> >    PyObject_VAR_HEAD   \
> >    long ob_shash;   \
> >    PyObject *cache;

> > with a typical concrete implementation looking like

> > typedef struct {
> >    PyObject_str_HEAD
> >    PyObject *encoding   /* concrete method implementation, not just
> > codecs */
> >    data
> > } PyAbstractUnicodeObject;

> I think Josiah is proposing a different implementation:

> typedef struct{
>   PyObject_VAR_HEAD
>   long ob_shash;
>   enum{L1,L2,L4} ob_elemsize;
>   ucs4 ob_sval[1]; /* could be interpreted as char* or ucs2* as well */
> } PyUnicodeObject;

Yes.

By knowing that there are only three possible encodings, he reduces
the encoding pointer into an enum.

By knowing that there is only one possible representation for a given
string, he skips the equivalency cache.  On the other hand, he also
loses the equivalency cache.  When python 2.x chooses the unicode
width, it tries to match tcl; under a "minimal size possible" scheme,
strings that fit in ASCII will have to be recoded twice on every round
trip.  The same problem pops up with other extension modules, and with
system encodings.

By exposing the full object insted of the abstract interface,
compilers can do pointer addition instead of calling a get_data
function.  But they still don't know (until run time) how wide the
data at that pointer will be, and we're locked into binary
compatibility.

> > Python is normally pretty good about duck typing, but str is a
> > notorious exception.

> You seem to be talking about polymorphism through inheritance.

Partly.  But for various reasons, strings seem to cause more problems.
 These would go away if people used the API instead of isinstance,
PyString_Check, and PyString_CheckExact (or just assuming the
results), but they don't -- because the layout is public.

> I doubt any kind of "pluggable" representation could work in a
> reasonable way. With that generality, you lose any information
> as to what the internal representation is, and then code becomes
> tedious to write and slow to run.

Instead of working with ((string)obj).data directly, you work with
string.recode(object, desired)

I see nothing wrong with adding (err ... keeping) convenience methods
to the API for common or recommended encodings, so that obj.data
becomes UCS4(obj)

If you're saying this will be slow because it is a C function call,
then I can't really argue; I just think it will be a good trade for
all the times we don't recode at all (or recode only once/encoding).
I'll admit that I'm not sure what sort of data would make a real-world
(as opposed to contrived) benchmark.

-jJ