[Python-Dev] str() vs. unicode()

Sat, 22 Sep 2001 18:14:41 +0200

Guido van Rossum wrote:
>=20
> > I'd like to query for the common opinion on an issue which I've
> > run into when trying to resynchronize unicode() and str() in terms
> > on what happens when you pass arbitrary objects to these constructors
> > which happen to implement tp_str (or __str__ for instances).
> >
> > Currenty, str() will accept any object which supports the tp_str
> > interface and revert to tp_repr in case that slot should not
> > be available.
> >
> > unicode() supported strings, character buffers and instances
> > having a __str__ method before yesterdays checkins.
> >
> > Now the goal of the checkins was to make str() and unicode()
> > behave in a more compatible fashion. Both should accept
> > the same kinds of objects and raise exceptions for all others.
>=20
> Well, historically, str() has rarely raised exceptions, because
> there's a default implementation (same as for repr(), returning <FOO
> object at ADDRESS>.  This is used when neither tp_repr nor tp_str is
> set.  Note that PyObject_Str() never looks at __str__ -- this is done
> by the tp_str handler of instances (and now also by the tp_str handler
> of new-style classes).  I see no reason to change this.

Me neither; what str() does not do (and unicode() does) is try
the char buffer interface before trying tp_str.
=20
> The question then becomes, do we want unicode() to behave similarly?

Given that porting an application from strings to Unicode should
be easy, I'd say: yes.
=20
> > The path I chose was to fix PyUnicode_FromEncodedObject()
> > to also accept tp_str compatible objects. This API is used
> > by the unicode_new() constructor (which is exposed as unicode()
> > in Python) to create a Unicode object from the input object.
> >
> > str() OTOH uses PyObject_Str() via string_new().
> >
> > Now there also is a PyObject_Unicode() API which tries to
> > mimic PyObject_Str(). However, it does not support the additional
> > encoding and errors arguments which the unicode() constructor
> > has.
> >
> > The problem which Guido raised about my checkins was that
> > the changes to PyUnicode_FromEncodedObject() are seen not
> > only in unicode(), but also all other instances where this
> > API is used.
> >
> > OTOH, PyUnicode_FromEncodedObject() is the most generic constructor
> > for Unicode objects there currently is in Python.
> >
> > So the questions are
> > - should I revert the change in PyUnicode_FromEncodedObject()
> >   and instead extend PyObject_Unicode() to support encodings ?
> > - should we make PyUnicode_Object() use
> >   PyUnicode_FromEncodedObject() instead of providing its
> >   own implementation ?
> >
> > The overall picture of all this auto-conversion stuff going
> > on in str() and unicode() is very confusing. Perhaps what
> > we really need is first to agree on a common understanding
> > of which auto-conversion should take place and then make
> > str() and unicode() support exactly the same interface ?!
> >
> > PS: Also see patch #446754 by Walter D=F6rwald:
> > http://sourceforge.net/tracker/?func=3Ddetail&atid=3D305470&aid=3D446=
754&group_id=3D5470
>=20
> OK, let's take a step back.
>=20
> The str() function (now constructor) converts *anything* to a string;
> tp_str and tp_repr exist to allow objects to customize this.  These
> slots, and the str() function, take no additional arguments.  To
> invoke the equivalent of str() from C, you call PyObject_Str().  I see
> no reason to change this; we may want to make the Unicode situation is
> similar as possible.

Right.
=20
> The unicode() function (now constructor) traditionally converted only
> 8-bit strings to Unicode strings,=20

Slightly incorrect: it converted 8-bit strings, objects compatible=20
to the char buffer interface and instances having a __str__ method to
Unicode.

To synchronize unicode() with str() we'd have to replace the __str__
lookup with a tp_str lookup (this will also allow things like unicode(2)
and unicode(instance_having__str__)) and maybe also add the charbuf=20
lookup to str() (this would make str() compatible with memory mapped
files and probably a few other char buffer aware objects as well).

Note that in a discussion we had some time ago we decided that __str__
should be allowed to return Unicode objects as well (instead of
defining a separate __unicode__ method/slot for this purpose). str()
will convert a Unicode return value to an 8-bit string using the
default encoding while unicode() takes the return value as-is.

This was done to simplify moving from strings to Unicode.

> with additional arguments to specify
> the encoding (and error handling preference).  There is no tp_unicode
> slot, but for some reason there are at least three C APIs that could
> correspond to unicode(): PyObject_Unicode() and PyUnicode_FromObject()
> take a single object argument, and PyObject_FromEncodedObject() takes
> object, encoding, and error arguments.
>=20
> The first question is, do we want the unicode() constructor to be
> applicable in all cases where the str() constructor is? =20

Yes.

> I guess that
> we do, since we want to be able to print to streams that support
> Unicode.  Unicode strings render themselves as Unicode characters to
> such a stream, and it's reasonable to allow other objects to also
> customize their rendition in Unicode.
>=20
> Now, what should be the signature of this conversion?  If we print
> object X to a Unicode stream, should we invoke unicode(X), or
> unicode(X, encoding, error)?  I believe it should be just unicode(X),
> since the encoding used by the stream shouldn't enter into the picture
> here: that's just used for converting Unicode characters written to
> the stream to some external format.
>=20
> How should an object be allowed to customize its Unicode rendition?
> We could add a tp_unicode slot to the type object, but there's no
> need: we can just look for a __unicode__ method and call it if it
> exists.  The signature of __unicode__ should take no further
> arguments: unicode(X) should call X.__unicode__().  As a fallback, if
> the object doesn't have a __unicode__ attribute, PyObject_Str() should
> be called and the resulting string converted to Unicode using the
> default encoding.

I'd rather leave things as they are: __str__/tp_str are allowed
to return Unicode objects and if an object wishes to be rendered
as Unicode it can simply return a Unicode object through the
__str__/tp_str interface.
=20
> Regarding the "long form" of unicode(), unicode(X, encoding, error), I
> see no reason to treat this with the same generality.  This form
> should restrict X to something that supports the buffer API (IOW,
> 8-bit string objects and things that are treated the same as these in
> most situations).=20

Hmm, but this would restrict users from implementing string like
objects (i.e. objects having the __str__ method to make it compatible
to str()).

> (Note that it already balks when X is a Unicode
> string.)

True -- since you normally cannot decode Unicode into Unicode using=20
some 8-bit character encoding. As a result encodings which convert=20
Unicode to Unicode (e.g. normalizations) cannot use this interface,
but since these are probably only rarely used, I think it's better
to prevent accidental usage of an 8-bit character codec on Unicode.
=20
> So about those C APIs: I propose that PyObject_Unicode() correspond to
> the one-arg form of unicode(), taking any kind of object, and that
> PyUnicode_FromEncodedObject() correspond to the three-arg form.

Ok. I'll fix this once we've reached consensus on what to do
about str() and unicode().

> PyUnicode_FromObject() shouldn't really need to exist.=20

Note: PyUnicode_FromObject() was extended by PyUnicode_FromEncodedObject(=
)
and only exists for backward compatibility reasons.

> I don't see a
> reason for PyUnicode_From[Encoded]Object() to use the __unicode__
> customization -- it should just take the bytes provided by the object
> and decode them according to the given encoding.  PyObject_Unicode(),
> on the other hand, should look for __unicode__ first and then
> PyObject_Str().
>=20
> I hope this helps.

Thanks for the summary.

--=20
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Consulting & Company:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/