Unicode issue with Python v3.3

Sat Apr 13 00:50:45 EDT 2013

Τη Σάββατο, 13 Απριλίου 2013 4:41:57 π.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε:
> On 11Apr2013 09:55, Nikos <nagia.retsina at gmail.com> wrote:
> 
> | Τη Πέμπτη, 11 Απριλίου 2013 1:45:22 μ.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε:
> 
> | > On 10Apr2013 21:50, nagia.retsina at gmail.com <nagia.retsina at gmail.com> wrote:
> 
> | > | the doctype is coming form the attempt of script metrites.py to open and read the 'index.html' file.
> 
> | > | But i don't know how to try to open it as a byte file instead of an tetxt file.
> 
> 
> 
> Lele Gaifax showed one way:
> 
> 
> 
>     from codecs import open
> 
>     with open('index.html', encoding='utf-8') as f:
> 
>         content = f.read()
> 
> 
> 
> But a plain open() should also do:
> 
> 
> 
>     with open('index.html') as f:
> 
>         content = f.read()
> 
> 
> 
> if you're not taking tight control of the file encoding.
> 
> 
> 
> The point here is to get _text_ (i.e. str) data from the file, not bytes.
> 
> 
> 
> If the text turns out to be incorrectly decoded (i.e. incorrectly
> 
> reading the file bytes and assembling them into text strings) because
> 
> the default encoding is wrong, then you may need to read for Lele's
> 
> more verbose open() example to select the correct encoding.
> 
> 
> 
> But first ignore that and get text (str) instead of bytes.
> 
> If you're already getting text from the file, something later is
> 
> making bytes and handing it to print().
> 
> 
> 
> Another approach to try is to use
> 
>   sys.stdout.write()
> 
> instead of
> 
>   print()
> 
> 
> 
> The print() function will take _anything_ and write text of some form.
> 
> The write() function will throw an exception if it gets the wrong type of data.
> 
> 
> 
> If sys.stdout is opened in binary mode then write() will require
> 
> bytes as data; strings will need to be explicitly turned into bytes
> 
> via .encode() in order to not raise an exception.
> 
> 
> 
> If sys.stdout is open in text mode, write() will require str data.
> 
> The sys.stdout file itself will transcribe to bytes for you.
> 
> 
> 
> If you take that route, at least you will not have confusion about
> 
> str versus bytes.
> 
> 
> 
> For an HTML output page I would advocate arranging that sys.stdout
> 
> is in text mode; that way you can do the natural thing and .write()
> 
> str data and lovely UTF-8 bytes will come out the other end.
> 
> 
> 
> If the above test (using .write() instead of print()) shows it to
> 
> be in binary mode we can fix that. But you need to find out.
> 
> 
> 
> You will want access to the error messages from the CGI environment;
> 
> do you have access to the web servers error_log? You can tail that
> 
> in a terminal while you reload the page to see what's going on.
> 
> 
> 
> | This works in the shell, but doesn't work on my website:
> 
> | 
> 
> | $ cat utf8.txt
> 
> | υλικό!Πρόκειται γ
> 
> 
> 
> Ok, so your terminal is using UTF-8 as its output coding. (And so
> 
> is your mail posting program, since we see it unmangled on my screen
> 
> here.)
> 
> 
> 
> | $ python3
> 
> | Python 3.2.3 (default, Oct 19 2012, 20:10:41)
> 
> | [GCC 4.6.3] on linux2
> 
> | Type "help", "copyright", "credits" or "license" for more information.
> 
> | >>> data = open('utf8.txt').read()
> 
> | >>> print(data)
> 
> | υλικό!Πρόκειται γ
> 
> 
> 
> Likewise.
> 
> 
> 
> However, in an exciting twist, I seem to recall that Python invoked
> 
> interactively with aterminal as output will have the default terminal
> 
> encoding in place on sys.stdout. Producing what you expect. _However_,
> 
> python invoked in a batch environment where stdout is not a terminal
> 
> (such as in the CGI environment producing your web page), that is
> 
> _not_ necessarily the case.
> 
> 
> 
> | >>> print(data.encode('utf-8'))
> 
> | b'\xcf\x85\xce\xbb\xce\xb9\xce\xba\xcf\x8c!\xce\xa0\xcf\x81\xcf\x8c\xce\xba\xce\xb5\xce\xb9\xcf\x84\xce\xb1\xce\xb9 \xce\xb3\n'
> 
> | 
> 
> | See, the last line is what i'am getting on my website.
> 
> 
> 
> The above line takes your Unicode text in "data" and transcribed
> 
> it to bytes using UTF-8 as the encoding. And print() is then receiving
> 
> that bytes object and printing its str() representation as "b'....'".
> 
> That str is itself unicode, and when print passes it to sys.stdout,
> 
> _that_ transcribed the unicode "b'...'" string as bytes to your
> 
> terminal. Using UTF-8 based on the previous examples above, but
> 
> since all those characters are in the bottom 127 code range the
> 
> byte sequence will be the same if it uses ASCII or ISO8859-1 or
> 
> almost anything else:-)
> 
> 
> 
> As you can see, there's a lot of encoding/decoding going on behind
> 
> the scenes even in this superficially simple example.
> 
> 
> 
> | If i remove
> 
> | the encode('utf-8') part in metrites.py, the webpage will not show
> 
> | anything at all...
> 
> 
> 
> Ah, but data will be being output. The print() function _will_ be
> 
> writing "data" out in some form.  I suggest you remove the .encode()
> 
> and then examine the _source_ text of the web page, not its visible
> 
> form.
> 
> 
> 
> So: remove .encode(), reload the web page, "view page source"
> 
> (depends on your browser, it is ctrl-U in Firefox ((Cmd-U in firefox
> 
> on a Mac))).
> 
> 
> 
> I think a lot of the issue you have in this thread is that your
> 
> page is too complex. Make another page to do the same thing, and
> 
> start with nothing. Add stuff to it a single item at a time until
> 
> the page behaves incorrectly. Then you will know the exact item of
> 
> code that introduced the issue. And then that single item can be
> 
> examined in detail for the decode/encode issues.
> 
> 
> 
> The other issue in the thread is that people losing patience get
> 
> snarky. Respond only to the technical content. If a message is only
> 
> snarky, _ignore_ it. People like the last word; let them have it
> 
> and you won't get sidetracked into arguments.
> 
> 
> 
> Cheers,
> 
> -- 
> 
> Cameron Simpson <cs at zip.com.au>
> 
> 
> 
> PCs are like a submarine, it will work fine till you open Windows. - zollie101

First of all thank you very much Cameron for your detailed help and effort to write to me:

It seems another issue had happened without my knowledge, i was uploading stuff at /root/public_html/cgi-bin instead of /home/nikos/public_html/cgi-bin.

I realized that when i deliberately made error to metrites.py scropt and i got still the same page.

Ookey after that is corrected, i then tried the plain solution and i got this response back form the shell:

Traceback (most recent call last):
  File "metrites.py", line 213, in <module>
    htmldata = f.read()
  File "/root/.local/lib/python2.7/lib/python3.3/encodings/iso8859_7.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0xae in position 47: character maps to <undefined>

then i switched to:

		with open('/home/nikos/www/' + page, encoding='utf-8') as f:
			htmldata = f.read()

and i got no error at all, just pure run *from the shell*!
But i get internal server error when i try to run the webpage from the browser(Chrome).

So, can you tell me please where can i find the apache error log so to display here please?

Apcher error_log is always better than running 'python3 metrites.py' because even if the python script has no error apache will also display more web related things?

Thank you Cameron.