sys.stdout, urllib and unicode... I don't understand.

Tue Nov 11 15:18:26 EST 2008

Hello fellow pythonists,

I'm a relatively new python developer, and I try to adjust my
understanding about "how things works" to python, but I have hit a
block, that I cannot understand.
I needed to output unicode datas back from a web service, and could
not get back unicode/multibyte text before applying an hack that I
don't understand (thank you google)

I have realized an wxPython simple application, that takes the input
of a user, send it to a web service, and get back translations in
several languages.
The service itself is fully UTF-8.

The "source" string is first encoded to "latin1" after a passage into
unicode.normalize(), as urllib.quote() cannot work on unicode
>>srcText=unicodedata.normalize('NFKD',srcText).encode('latin1','ignore')

After that, an urllib request is sent with this encoded string to the
web service
>>con=urllib2.Request(self.url, headers={'User-Agent':'Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11'}, origin_req_host='http://translate.google.com')

>>req=urllib2.urlopen(con)

First problem, how to determine the encoding of the return ?
If I inspect a request from firefox, I see that the server return
header specify UTF-8
But if I use this code:
>>ret=U''
>>for line in req:
>>  ret=ret+string.replace(line.strip(),'\n',chr(10))
I end up with an UnicodeDecodeError. I tried various line.decode(),
line.normalize and such, but could not make this error disapear.
I, until now, avoided that problem as the service always seems to
return 1 line, but I am wondering.

Second problem, if I try an
>>print line
into the loop, I too get the same error. I though that unicode() would
force python to consider the given text as unicode, not to try to
convert it to unicode.
Here again, trying several normalize/decode combination did not helped
at all.

Then, looking for help through google, I have found this post:
http://mail.python.org/pipermail/python-list/2007-October/462977.html
and I gave it a try. What I did, though, was not to override
sys.stdout, but to declare a new writer stream as a property of my
main class:
>>self.out=OutStreamEncoder(sys.stdout, 'utf-8')

But what is strange, is that since I did that, even without using this
self.out writer, the unicode translation are working as I was
expecting them to. Except on the for loop, where a concatenation still
triggers the UnicodeDecodeErro exception.
I know the "explicit is better than implicit" python motto, and I
really like it.
But here, I don't understand what is going on.

Does the fact that defining that writer object does a initialization
of the standard sys.stdout object ?
Does it is related to an internal usage of it, maybe in urllib ?
I tried to find more on the subject, but felt short.
Can someone explain to me what is happening ?
The full script source can be found at http://www.webalis.com/translator/translator.pyw