unicode question

Tue Nov 23 14:35:21 EST 2004

Bengt Richter wrote:
>>Because print invokes str() on its argument, unless the argument is
>>already a byte string (in which case it prints it directly), or a
> 
>                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^-- effectively an assumption that
> bytestring.decode('some_unknown_encoding').encode(sys.stdout.encoding)
> has already been done, it seems (I'm not arguing against).

Not really. sys.stdout really is a byte string, which may or may
not *have* an encoding. Python tries to guess, and refuses to
in the face of ambiguity: e.g. if sys.stdout is a file, resulting
from

python mkimage.py > image.gif

then sys.stdout really does not *have* an encoding - but it still
is a byte stream. So copying the bytes to stdout is a
straight-forward thing to do.

Of course, "print" should only be used if the stream is meant to
transmit characters, and then the bytes written to the stream should
use the stream's encoding. This is indeed the assumption - but one
that the application author needs to make.

> So how about changing print so that it doesn't blindly use str(y)

On the C level, this is already possible, through tp_print. Whether or
not this should be exposed to the Python level (or whether doing so
would just add to the confusion), I don't know.

 > but instead
 > first tries to get y.__str__() in case the latter returns unicode?
 > Then print y can succeed the way print y.__str__() does now.

As yet another alternative, print could invoke unicode(), if
there is a stream encoding. This would try __unicode__first,
then fall back to call __str__. Patches in this direction would
be welcome - but the code implementing print is already quite
involved, so a redesign (with a PEP and everything) might also
be in order.

In P3k, this part of the issue will go away, as str() then will
return Unicode strings.

> I.e., str doesn't know that printing and '%s' can use unicode to good effect
> if it available, so for print and str.__mod__ blindly to use str() intermediately
> throws away an opportunity to do better ISTM.

That is true. Of course, there is already so much backwards
compatibility in this that any change to behaviour (such as
trying unicode() before trying str()) might break things.

Regards,
Martin