sys.stdout, urllib and unicode... I don't understand.

Tue Nov 11 17:12:38 EST 2008

Thierry wrote:
> Hello fellow pythonists,
> 
> I'm a relatively new python developer, and I try to adjust my
> understanding about "how things works" to python, but I have hit a
> block, that I cannot understand.
> I needed to output unicode datas back from a web service, and could
> not get back unicode/multibyte text before applying an hack that I
> don't understand (thank you google)
> 
> I have realized an wxPython simple application, that takes the input
> of a user, send it to a web service, and get back translations in
> several languages.
> The service itself is fully UTF-8.
> 
> The "source" string is first encoded to "latin1" after a passage into
> unicode.normalize(), as urllib.quote() cannot work on unicode
>>> srcText=unicodedata.normalize('NFKD',srcText).encode('latin1','ignore')

urllib.quote() operates on byte streams. If your web service is UTF-8
it would make sense to use UTF-8 as input encoding not latin1,
wouldn't it? unicodeinput.encode("utf-8")

> After that, an urllib request is sent with this encoded string to the
> web service
>>> con=urllib2.Request(self.url, headers={'User-Agent':'Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11'}, origin_req_host='http://translate.google.com')
> 
>>> req=urllib2.urlopen(con)
> 
> First problem, how to determine the encoding of the return ?

It is sent as part of the headers. e.g. content-type: text/html; 
charset=utf-8

> If I inspect a request from firefox, I see that the server return
> header specify UTF-8
> But if I use this code:
>>> ret=U''
>>> for line in req:
>>>  ret=ret+string.replace(line.strip(),'\n',chr(10))
> I end up with an UnicodeDecodeError. I tried various line.decode(),
> line.normalize and such, but could not make this error disapear.
> I, until now, avoided that problem as the service always seems to
> return 1 line, but I am wondering.

web server answer is encoded byte stream too (usually utf-8 but you
can check the headers)  so

line.decoce("utf-8") should give you unicode to operate on (always
do string operations on canonized form)

> Second problem, if I try an
>>> print line
> into the loop, I too get the same error. I though that unicode() would
> force python to consider the given text as unicode, not to try to
> convert it to unicode.

But it is what it does. Basically unicode() is a constructor for
unicode objects.

> Here again, trying several normalize/decode combination did not helped
> at all.

Its not too complicated, you just need to keep unicode and byte strings
separate and draw a clean line between the two. (the line is decode() 
and encode() )

> Then, looking for help through google, I have found this post:
> http://mail.python.org/pipermail/python-list/2007-October/462977.html
> and I gave it a try. What I did, though, was not to override
> sys.stdout, but to declare a new writer stream as a property of my
> main class:
>>> self.out=OutStreamEncoder(sys.stdout, 'utf-8')
> 

This is fancy but not needed if you take care like above.

HTH
Tino
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3241 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://mail.python.org/pipermail/python-list/attachments/20081111/a0e64f15/attachment-0001.bin>