sys.stdout, urllib and unicode... I don't understand.

Wed Nov 12 03:52:55 EST 2008

On Tue, 11 Nov 2008 12:18:26 -0800, Thierry wrote:

> I have realized an wxPython simple application, that takes the input of
> a user, send it to a web service, and get back translations in several
> languages.
> The service itself is fully UTF-8.
> 
> The "source" string is first encoded to "latin1" after a passage into
> unicode.normalize(), as urllib.quote() cannot work on unicode
>>>srcText=unicodedata.normalize('NFKD',srcText).encode('latin1','ignore')

If the service uses UTF-8 why don't you just encode the data you send as 
UTF-8 but Latin-1 with potentially throwing away data because of the 
'ignore' argument!?  Make that ``src_text = unicodedata.encode('utf-8')``

>>>req=urllib2.urlopen(con)
> 
> First problem, how to determine the encoding of the return ? If I
> inspect a request from firefox, I see that the server return header
> specify UTF-8
> But if I use this code:
>>>ret=U''
>>>for line in req:
>>>  ret=ret+string.replace(line.strip(),'\n',chr(10))
> I end up with an UnicodeDecodeError.

Because `line` contains bytes and `ret` is a `unicode` object.  If you 
add a `unicode` object and a `str` object, Python tries to convert the 
`str` to `unicode` using the default == ASCII encoding.  And this fails 
if there are byte value >127.  *You* have to decode `line` from a bunch 
of bytes to a bunch of (unicode)characters before you concatenate the 
strings.

BTW: ``line.strip()`` removes all whitespace at both ends *including 
newlines*, so there are no '\n' to replace anymore.  And functions in the 
`string` module that are also implemented as method on `str` or `unicode` 
are deprecated.

Ciao,
	Marc 'BlackJack' Rintsch