sys.stdout, urllib and unicode... I don't understand.
Marc 'BlackJack' Rintsch
bj_666 at gmx.net
Wed Nov 12 03:52:55 EST 2008
On Tue, 11 Nov 2008 12:18:26 -0800, Thierry wrote:
> I have realized an wxPython simple application, that takes the input of
> a user, send it to a web service, and get back translations in several
> languages.
> The service itself is fully UTF-8.
>
> The "source" string is first encoded to "latin1" after a passage into
> unicode.normalize(), as urllib.quote() cannot work on unicode
>>>srcText=unicodedata.normalize('NFKD',srcText).encode('latin1','ignore')
If the service uses UTF-8 why don't you just encode the data you send as
UTF-8 but Latin-1 with potentially throwing away data because of the
'ignore' argument!? Make that ``src_text = unicodedata.encode('utf-8')``
>>>req=urllib2.urlopen(con)
>
> First problem, how to determine the encoding of the return ? If I
> inspect a request from firefox, I see that the server return header
> specify UTF-8
> But if I use this code:
>>>ret=U''
>>>for line in req:
>>> ret=ret+string.replace(line.strip(),'\n',chr(10))
> I end up with an UnicodeDecodeError.
Because `line` contains bytes and `ret` is a `unicode` object. If you
add a `unicode` object and a `str` object, Python tries to convert the
`str` to `unicode` using the default == ASCII encoding. And this fails
if there are byte value >127. *You* have to decode `line` from a bunch
of bytes to a bunch of (unicode)characters before you concatenate the
strings.
BTW: ``line.strip()`` removes all whitespace at both ends *including
newlines*, so there are no '\n' to replace anymore. And functions in the
`string` module that are also implemented as method on `str` or `unicode`
are deprecated.
Ciao,
Marc 'BlackJack' Rintsch
More information about the Python-list
mailing list