[XML-SIG] cgi unicode encoding problem

Tue Mar 22 01:12:59 CET 2005

This is probably more of a comp.lang.python question, since the XML
content seems low, but anyway ...

On Mon, 2005-03-21 at 23:22 +0000, James King wrote:
> Hi
> 
> I am trying to convert CGI data, which arrives encoded with escape 
> characters, into unicode data.
> 
> t represents the type of character data that I start with (the result 
> of fetching cgi-field data from a cgi.FieldStorage object).
> 
>  >>> t = '\x93quotation marks\x94, and a series of other characters: 
> \x91\xe5 \xdf \xa9 \xe6 \xee \x9c\x92'
>  >>> import codecs
>  >>> t.encode('utf-8')
> 
> Traceback (most recent call last):
>    File "<stdin>", line 1, in ?
> UnicodeError: ASCII decoding error: ordinal not in range(128)

Not surprising. Given no other information, Python will assume t has
ASCII encoding, which is obviously not true (as indicated by the error
message).

The problem is that you need to know the encoding of the input string.
The initial 0x93 byte is a problem: it is not ASCII, it is not ISO-8859-
*, it is not UTF-8 or UTF-16. Looks like it might be the Windows-
specific codepage 1252 encoding, since you are hinting that there are
initial quotation marks.

        >>> s = unicode(t, encoding)

where 'encoding' is a string like 'iso-8859-1' or 'utf-8' or whatever
the input encoding is (I'm not sure how the cp-1252 encoding is
represented). After that, statements like s.encode('utf-8') will make
sense.

Cheers,
Malcolm