[Tutor] unicode: % & __str__ & str()

Dave Angel davea at ieee.org
Sat Oct 31 04:36:33 CET 2009


spir wrote:
> [back to the list after a rather long break]
>
> Hello,
>
> I stepped on a unicode issue ;-) (one more)
> Below an illustration:
>
> ============================class U(unicode):
> 	def __str__(self):
> 		return self
>
> # if you can't properly see the string below,
> # 128<ordinals<255
> c0 =¶ÿµ"
> c1 =("¶ÿµ","utf8")
> c2 =nicode("¶ÿµ","utf8")
>
> for c in (c0,c1,c2):
> 	try:
> 		print "%s" %c,
> 	except UnicodeEncodeError:
> 		print "***",
> 	try:
> 		print c.__str__(),
> 	except UnicodeEncodeError:
> 		print "***",
> 	try:
> 		print str(c)
> 	except UnicodeEncodeError:
> 		print "***"
>
> =
>
> ¶ÿµ ¶ÿµ ¶ÿµ
> ¶ÿµ ¶ÿµ ***
> ¶ÿµ *** ***
> ==============================
>
> The last line shows that a regular unicode cannot be passed to str() (more or less ok) nor __str__() (not ok at all).
> Maybe I overlook some obvious point (again). If not, then this means 2 issues in fact:
>
> -1- The old ambiguity of str() meaning both "create an instance of type str from the given data" and "build a textual representation of the given object, through __str__", which has always been a semantic flaw for me, becomes concretely problematic when we have text that is not str.
> Well, i'm very surprised of this. Actually, how comes this point doesn't seem to be very well known; how is it simply possible to use unicode without stepping on this problem? I guess this breaks years or even decades of habits for coders used to write str() when they mean __str__().
>
> -2- How is it possible that __str__ does not work on a unicode object? It seems that the method is simply not implemented on unicode, the type, and __repr__ neither. So that it falls back to str().
> Strangely enough, % interpolation works, which means that for both types of text a short circuit is used, namely return the text itself as is. I would have bet my last cents that % would simply delegate to __str__, or maybe that they were the same func in fact, synonyms, but obviously I was wrong!
>
> Looking for workarounds, I first tried to overload (or rather create) __str__ like in the U type above. But this solution is far to be ideal cause we still cannot use str() (I mean my digits can write it while my head is who-knows-where). Also, it is really unusable in fact for the following reason:
> =================================
> print c1.__class__
> print c1[1].__class__
> c3 =1 ; print (c1+c3).__class__
> =
> <class '__main__.U'>
> <type 'unicode'>
> <type 'unicode'>
> ==================================
> Any operation will return back a unicode instead of the original type. So that the said type would have to overload all possible operations on text, which is much, indeed, to convert back the results. I don't even speak of performance issues.
>
> So, the only solution seems to me to use % everywhere, hunt all str and __str__ and __repr__ and such in all code.
>
> I hope I'm wrong on this. Please, give me a better solution ;-)
>
>
>
> ------
> la vita e estrany
>
>
>
>   
I'm not the one to help with this, because my unicode experience is 
rather limited.  But I think I know enough to ask a few useful questions.

1) What version of Python are you doing this on, what OS, and what code 
page is your stdout using?

2) What coding declaration do you have in your source file?  Without it, 
I can't even define those literals.  I added the line
#-*- coding: utf-8 -*-
as line 2 of my source file to get past that one.  But I really don't 
know much about this literal string that I pasted from your email.

3) Could you give us the hex equivalent of the 3 character string you're 
trying to give us in the email.  The only clue you gave us was that the 
bytes were between
129 and 254, which they aren't, on my machine, at least with a utf-8 
coding declaration.
repr(u"¶ÿµ") -->  u'\xb6\xff\xb5'   length= 3
repr(c0) -->  '\xc2\xb6\xc3\xbf\xc2\xb5'  length = 6

You say that __str__() isn't defined on Unicode objects, but that's not 
the case, at least in 2.6.2.   Works fine on ASCII characters, but 
something causes an exception for your strings.  Since you're eating the 
exception, all you know is something went wrong, not what went wrong.  
And since my environment is probably totally different, ...   I get the 
exception text: 'ascii' codec can't encode characters in position 0-2: 
ordinal not in range(128)

Incidentally, you'll probably save yourself a lot of grief in the long 
run if you change your editor to always expand tabs to spaces (4-per).  
It's dangerous, and not recommended to mix tabs and spaces in the same 
file, and it's surprising how often spaces get mixed in by accident.  In 
Python3.x it's illegal to mix them.

DaveA




More information about the Tutor mailing list