Help needed with python unicode cgi-bin script

Mon Dec 10 18:29:42 EST 2007

On Dec 11, 9:55 am, "weheh" <we... at verizon.net> wrote:
> Hi Martin, thanks for your response. My updates are interleaved with your
> response below:
>
> > What is the encoding of that file? Without a correct answer to that
> > question, you will not be able to achieve what you want.
>
> I don't know for sure the encoding of the file. I'm assuming it has no
> intrinsic encoding since I copied the word "año" into vim and then saved it
> as the example text file called, "spanish.txt".

Every text file encoded, and very few of them are tagged with the name
of the encoding in any reliable fashion.

>
> > Possible answers are "iso-8859-1", "utf-8", "windows-1252", and "cp850"
> > (these all support the word "año")
>
> >> Instead of seeing "año" I see "a?o".
>
> > I don't see anything here. Where do you see the question mark? Did you
> > perhaps run the CGI script in a web server, and pointed your web browser
> > to the web page, and saw the question mark in the web browser?
>
> The cgi-bin scripts prints to stdout, i.e. to my browser, and when I use
> print I see a square box where the ñ should be. When I use print repr(...) I
> see 'a\xf1o'. I never see the desired 'ñ' character.
>
> Sending "Content-type: text/html" is not enough. The web browser needs
>
> > to know what the encoding is. So you should send
>
> > Content-type: text/html; charset="your-encoding-here"
>
> Sorry, somehow my cut and paste job into outlook missed the exact line you
> had above that specifies encoding tp be set as "utf8", but it's there in my
> program. Not to worry.
>
> > Use "extras/page information" in Firefox to find out what the web
> > browser thinks the encoding of the page is.
>
> Firefox says the page is UTF8.
>
> > P.S. Please, stop shouting.
>
> OK, it's just that it hurts when I've been pulling my hair out for days on
> end over a single line of code. I don't want to go bald just yet.

Forget for the moment what you see in the browser. You need to find
out how your file is encoded.

Look at your file using
    print repr(open('c:/test/spanish.txt','rb').read())

If you see 'a\xf1o' then use charset="windows-1252" else if you see
'a\xc3\xb1o' then use charset="utf-8" else ????

Based on your responses to Martin, it appears that your file is
actually windows-1252 but you are telling browsers that it is utf-8.

Another check: if the file is utf-8, then doing
    open('c:/test/spanish.txt','rb').read().decode('utf8')
should be OK; if it's not valid utf8, it will complain.

Yet another check: open the file with Notepad. Do File/SaveAs, and
look at the Encoding box -- ANSI or UTF-8?

HTH,
John