Help needed with python unicode cgi-bin script

Tue Dec 11 14:54:59 EST 2007

On Dec 12, 4:46 am, "weheh" <we... at verizon.net> wrote:
> Hi John:
> Thanks for responding.
>
> >Look at your file using
>
>  >   print repr(open('c:/test/spanish.txt','rb').read())
>
> >If you see 'a\xf1o' then use charset="windows-1252"
>
> I did this ... no change ... still see 'a\xf1o'

So it's not utf-8, it's windows-1252, so stop lying to browsers: like
I said, use charset="windows-1252"

>
> >else if you see 'a\xc3\xb1o' then use charset="utf-8" else ????
> >Based on your responses to Martin, it appears that your file is
> >actually windows-1252 but you are telling browsers that it is utf-8.
> >Another check: if the file is utf-8, then doing
>
>  >   open('c:/test/spanish.txt','rb').read().decode('utf8')>should be OK; if it's not valid utf8, it will complain.
>
> No. this causes decode error:
>
> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-4: invalid
> data

No what? YES, the "decode error" is complaining that the data supplied
is NOT valid utf-8 data. So it's not utf-8, it's windows-1252, so stop
lying to browsers: like I said, use charset="windows-1252"

>       args = ('utf8', 'a\, 1, 5, 'invalid data')
>       encoding = 'utf8'
>       end = 5
>       object = 'a\xf1o'
>       reason = 'invalid data'
>       start = 1
>
> >Yet another check: open the file with Notepad. Do File/SaveAs, and
> >look at the Encoding box -- ANSI or UTF-8?
>
> Notepad says it's ANSI

That's correct (in Microsoft jargon) -- it's NOT utf-8. It's
windows-1252, so stop lying to browsers: like I said, use
charset="windows-1252"

>
> Thanks. What now?

Listen to the Bellman: "What I tell you three times is true".
Your file is encoded using windows-1252, NOT utf-8.
You need to use charset="windows-1252".

> Also, this is a general problem for me, whether I read
> from a file or read from an html text field, or read from an html text area.
> So I'm looking for a general solution. If it helps to debug by reading from
> textarea or text field, let me know.

If you are creating a file, you should know what its encoding is. As I
said earlier, *every* file is encoded -- so-called "Unicode" files on
Windows are encoded using utf16le. If you don't explicitly specify the
encoding, it will typically be the default encoding for your locale
(e.g. cp1252 in Western Europe etc).

If you are reading a file created by others and its encoding is not
known, you will have inspect the file and/or guess (using whatever
knowledge you have about the language/locale of the creator).

"whether I ... read from an html text field, or read from an html text
area": isn't that what "charset" is for?

HTH,
John