utf8 encoding problem

Wichert Akkerman wichert at wiggy.net
Thu Jan 22 05:35:49 EST 2004


I'm struggling with what should be a trivial problem but I can't seem to 
come up with a proper solution: I am working on a CGI that takes utf-8
input from a browser. The input is nicely encoded so you get something
like this:

  firstname=t%C3%A9s

where %C3CA9 is a single character in utf-8 encoding. Passing this
through urllib.unquote does not help:

  >>> urllib.unquote(u't%C3%A9st')
  u't%C3%A9st'

The problem turned out to be that urllib.unquote() process processes
its input character by character which breaks when it tries to call
chr() for a character: it gets a character which is not valid ascii
(outside the legal range) or valid unicode (it's only half a utf-8
character) and as a result it fails:

   >>> chr(195) + u""
   Traceback (most recent call last):
     File "<stdin>", line 1, in ?
   UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)


I can't seem to find a working method to do this conversion correctly.
Can someone point me in the right direction? (and please cc me on
replies since I'm not currently subscribed to this list/newsgroup).

Wichert.

-- 
Wichert Akkerman <wichert at wiggy.net>    It is simple to make things.
http://www.wiggy.net/                   It is hard to make things simple.





More information about the Python-list mailing list