unicode question

Mon Nov 22 09:51:13 EST 2004

On Mon, 22 Nov 2004 08:04:08 GMT, "wolfgang haefelinger" <wh2005 at web.de> wrote:

>Hi Martin,
>
>if print is implemented like this then I begin to understand the problem.
>
>Neverthelss, I regard
>
> print y.__str__()            ## works
> print y                           ## fails??
>
>as a very inconsistent behaviour.
>
>Somehow I have the feeling that Python should give up the distinction
>between unicode  and  str  and just have a str type which is internally
>unicode.
>
>
>Anyway, thanks for answering
>Wolfgang.
>
>""Martin v. Löwis"" <martin at v.loewis.de> wrote in message 
>news:41a0ab62$0$151$9b622d9e at news.freenet.de...
>> wolfgang haefelinger wrote:
>>> I was actually thinking that
>>>
>>>  print x
>>>
>>> is just kind of shortcur for writing (simplifying bit):
>>>
>>>  import sys
>>>  if not (isinstance(x,str) or isinstance(x,unicode)) and x.__str__ :
>>>     x = x.__str__()
>>>  sys.stdout.write(x)
>>
>> This is too simplifying. For the context of this discussion,
>> it is rather
>>
>> import sys
>> if isinstance(x, unicode) and sys.stdout.encoding:
>>     x = x.encode(sys.stdout.encoding)
>> x = str(x)
>> sys.stdout.write(x)
>>
>> (this, of course, is still quite simplicated. It ignores tp_print,
>> and it ignores softspaces).
>>
>>> Or in words: if x is not a string type but has method __str__ then
>>>
>>>  print x
>>>
>>> behaves like
>>>
>>>  print x.__str__()
>>
>> No. There are many types for which this is not true; in this specific
>> case, it isn't true for Unicode objects.
>>
>>> Is this a bug??
>>
>> No. You are just misunderstanding it.
>>
>> Regards,
>> Martin 
>
It's an old issue, and ISTM there is either a problem or it needs to be better explained.
My bet is on a problem ;-) ISTM the key is that a plain str type is a byte sequence but can
be interpreted as a byte-stream-encoded character sequence, and there are some seemingly
schizophrenic situations. E.g., start with a sequence of numbers, obviously just produced
by a polynomial formula having nothing to do with characters:

 >>> numbers = [(lambda x: (-499*x**4 +4634*x**3 -13973*x**2 +13918*x +1824)/24)(x) for x in xrange(5)]
 >>> numbers
 [76, 246, 119, 105, 115]

Now if we convert those to str type characters with chr() and join them:

 >>> s = ''.join(map(chr, numbers))

Then we have a sequence of bytes which could have had any numerical value in range(256). No character
encoding is assumed. Yet. If we now assume, say, a latin-1 encoding, we can decode the bytes into
unicode:

 >>> u = s.decode('latin-1')
 >>> type(u)
 <type 'unicode'>

Now if we print that, sys.stdout.encoding should come into play:

 >>> print u
 Löwis

 :-)

And we are ok, because we were explicit the whole way.
But if we don't decode s explicitly, it seems the system makes an assumption:

 >>> print s
 L÷wis

That is (if it survived) the 'cp437' character for byte '\xf6'. IOW, print seems
to assume that a plain str is encoded ready for output in sys.stdout.encoding in
a kind of reinterpret_cast of the str, or else a decode('cp437').encode('cp437')
optimized away.

 >>> sys.stdout.encoding
 'cp437'
 >>> sys.getdefaultencoding()
 'ascii'

If it were assuming s was encoded as ascii, it should really do s.decode('ascii').encode('cp437')
to get it printed, but for plain str literals it does not seem to do that. I.e.,

 >>> s.decode('ascii')
 Traceback (most recent call last):
   File "<stdin>", line 1, in ?
 UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position 1: ordinal not in range(128

doesn't work, so it can't be doing that. It seems to print s as s.decode('cp437').encode('cp437')

 >>> s.decode('cp437')
 u'L\xf7wis'

but that is a wrong decoding, (though the system can't be expected to know).

 >>> print s.decode('cp437').encode('cp437')
 L÷wis
 >>> print s.decode('latin-1').encode('cp437')
 Löwis

What other decoding should be attempted, lacking an indication? sys.getdefaultencoding()
might be reasonable, but it seems to be locked into 'ascii' (I don't know how to set it)

 >>> sys.getdefaultencoding = lambda: 'latin-1'
 >>> sys.getdefaultencoding()
 'latin-1'
 >>> unicode('L\xf6wis')
 Traceback (most recent call last):
   File "<stdin>", line 1, in ?
 UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position 1: ordinal not in range(128

So, bottom line, as Wolfgang effectively asked by his example, why does print try to coerce
the __str__ return value to ascii on the way to the ouput encoder, when there is encoding info
in the unicode object that it is happy to defer reencoding of for sys.stdout.encoding?

 >>> s
 'L\xf6wis'
 >>> u
 u'L\xf6wis'
 >>> print s
 L÷wis
 >>> print u
 Löwis
 >>> class Y:
 ...     def __str__(self): return self.c
 ...
 >>> y = Y()
 >>> y.c = s
 >>> print y
 L÷wis
 >>> y.c = u
 >>> print y
 Traceback (most recent call last):
   File "<stdin>", line 1, in ?
 UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 1: ordinal not in r
 ange(128)
 >>> print u
 Löwis

Maybe the output of __str__ should be ok as a type basestring subclass for print, so
 y.c = u
 print y
above has the same result as
 print u

It seems to be trying to do u.encode('ascii').decode('ascii').encode('cp437')
instead of directly u.encode('cp437') when __str__ is involved.

 >>> print u'%s' % y
 Löwis

works, and

 >>> print '%s' % u
 Löwis

works, and 

 >>> print y.__str__()
 Löwis

and

 >>> print y.c
 Löwis

works,
>>> y.c
u'L\xf6wis'

but

 >>> print '%s'%y
 Traceback (most recent call last):
   File "<stdin>", line 1, in ?
 UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 1: ordinal not in r
 ange(128)

and never mind print,

 >>> '%s' % u
 u'L\xf6wis'
 >>> '%s' % y.__str__()
 u'L\xf6wis'
 >>> '%s' % y
 Traceback (most recent call last):
   File "<stdin>", line 1, in ?
 UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 1: ordinal not in r
 ange(128)

I guess its that str.__mod__(self, other) can deal with a unicode other and get promoted, but
it must do str(other) instead of other.__str__(), or it would be able to promote the result in
the latter case too...

This seems like a possible change that could smooth things a bit, especially if print a,b,c
was then effectively the same as print ('%s'%a),('%s'%b),('%s'%c) with encoding promotion.

Regards,
Bengt Richter