[Python-Dev] str() vs. unicode()
Guido van Rossum
guido@python.org
Fri, 21 Sep 2001 10:59:27 -0400
> I'd like to query for the common opinion on an issue which I've
> run into when trying to resynchronize unicode() and str() in terms
> on what happens when you pass arbitrary objects to these constructors
> which happen to implement tp_str (or __str__ for instances).
>
> Currenty, str() will accept any object which supports the tp_str
> interface and revert to tp_repr in case that slot should not
> be available.
>
> unicode() supported strings, character buffers and instances
> having a __str__ method before yesterdays checkins.
>
> Now the goal of the checkins was to make str() and unicode()
> behave in a more compatible fashion. Both should accept
> the same kinds of objects and raise exceptions for all others.
Well, historically, str() has rarely raised exceptions, because
there's a default implementation (same as for repr(), returning <FOO
object at ADDRESS>. This is used when neither tp_repr nor tp_str is
set. Note that PyObject_Str() never looks at __str__ -- this is done
by the tp_str handler of instances (and now also by the tp_str handler
of new-style classes). I see no reason to change this.
The question then becomes, do we want unicode() to behave similarly?
> The path I chose was to fix PyUnicode_FromEncodedObject()
> to also accept tp_str compatible objects. This API is used
> by the unicode_new() constructor (which is exposed as unicode()
> in Python) to create a Unicode object from the input object.
>
> str() OTOH uses PyObject_Str() via string_new().
>
> Now there also is a PyObject_Unicode() API which tries to
> mimic PyObject_Str(). However, it does not support the additional
> encoding and errors arguments which the unicode() constructor
> has.
>
> The problem which Guido raised about my checkins was that
> the changes to PyUnicode_FromEncodedObject() are seen not
> only in unicode(), but also all other instances where this
> API is used.
>
> OTOH, PyUnicode_FromEncodedObject() is the most generic constructor
> for Unicode objects there currently is in Python.
>
> So the questions are
> - should I revert the change in PyUnicode_FromEncodedObject()
> and instead extend PyObject_Unicode() to support encodings ?
> - should we make PyUnicode_Object() use
> PyUnicode_FromEncodedObject() instead of providing its
> own implementation ?
>
> The overall picture of all this auto-conversion stuff going
> on in str() and unicode() is very confusing. Perhaps what
> we really need is first to agree on a common understanding
> of which auto-conversion should take place and then make
> str() and unicode() support exactly the same interface ?!
>
> PS: Also see patch #446754 by Walter Dörwald:
> http://sourceforge.net/tracker/?func=detail&atid=305470&aid=446754&group_id=5470
OK, let's take a step back.
The str() function (now constructor) converts *anything* to a string;
tp_str and tp_repr exist to allow objects to customize this. These
slots, and the str() function, take no additional arguments. To
invoke the equivalent of str() from C, you call PyObject_Str(). I see
no reason to change this; we may want to make the Unicode situation is
similar as possible.
The unicode() function (now constructor) traditionally converted only
8-bit strings to Unicode strings, with additional arguments to specify
the encoding (and error handling preference). There is no tp_unicode
slot, but for some reason there are at least three C APIs that could
correspond to unicode(): PyObject_Unicode() and PyUnicode_FromObject()
take a single object argument, and PyObject_FromEncodedObject() takes
object, encoding, and error arguments.
The first question is, do we want the unicode() constructor to be
applicable in all cases where the str() constructor is? I guess that
we do, since we want to be able to print to streams that support
Unicode. Unicode strings render themselves as Unicode characters to
such a stream, and it's reasonable to allow other objects to also
customize their rendition in Unicode.
Now, what should be the signature of this conversion? If we print
object X to a Unicode stream, should we invoke unicode(X), or
unicode(X, encoding, error)? I believe it should be just unicode(X),
since the encoding used by the stream shouldn't enter into the picture
here: that's just used for converting Unicode characters written to
the stream to some external format.
How should an object be allowed to customize its Unicode rendition?
We could add a tp_unicode slot to the type object, but there's no
need: we can just look for a __unicode__ method and call it if it
exists. The signature of __unicode__ should take no further
arguments: unicode(X) should call X.__unicode__(). As a fallback, if
the object doesn't have a __unicode__ attribute, PyObject_Str() should
be called and the resulting string converted to Unicode using the
default encoding.
Regarding the "long form" of unicode(), unicode(X, encoding, error), I
see no reason to treat this with the same generality. This form
should restrict X to something that supports the buffer API (IOW,
8-bit string objects and things that are treated the same as these in
most situations). (Note that it already balks when X is a Unicode
string.)
So about those C APIs: I propose that PyObject_Unicode() correspond to
the one-arg form of unicode(), taking any kind of object, and that
PyUnicode_FromEncodedObject() correspond to the three-arg form.
PyUnicode_FromObject() shouldn't really need to exist. I don't see a
reason for PyUnicode_From[Encoded]Object() to use the __unicode__
customization -- it should just take the bytes provided by the object
and decode them according to the given encoding. PyObject_Unicode(),
on the other hand, should look for __unicode__ first and then
PyObject_Str().
I hope this helps.
--Guido van Rossum (home page: http://www.python.org/~guido/)