file.write() of non-ASCII characters differs in Interpreted Python than in script run

Tue Aug 25 17:19:53 EDT 2015

Dear All,

I experienced an incomprehensible behavior (I've spent already many hours on this subject): the `file.write('string')` provides an error in run mode and not when interpreted at the console. The string must contain non-ASCII characters. If all ASCII, there is no error.

The following example shows what I can see. I must overlook something because I cannot think Python makes a difference between interpreted and run modes and yet ... Can someone please check that subject.

Thank you in advance.
René

Code extract from WSGI application (reply.py)
=============================================

    request_body = environ['wsgi.input'].read(request_body_size)    # bytes
    rb = request_body.decode()                                      # string
    d = parse_qs(rb)                                                # dict

    f = open('logbytes', 'ab')
    g = open('logstr', 'a')
    h = open('logdict', 'a')

    f.write(request_body)
    g.write(str(type(request_body)) + '\t' + str(type(rb)) + '\t' + str(type(d)) + '\n')
    h.write(str(d) + '\n')      <--- line 28 of the application

    h.close()
    g.close()
    f.close()

Tail of Apache2 error.log
=========================

[Tue Aug 25 20:24:04.657933 2015] [wsgi:error] [pid 3677:tid 3029764928] [remote 192.168.1.5:27575]   File "reply.py", line 28, in application
[Tue Aug 25 20:24:04.658001 2015] [wsgi:error] [pid 3677:tid 3029764928] [remote 192.168.1.5:27575]     h.write(str(d) + '\\n')
[Tue Aug 25 20:24:04.658201 2015] [wsgi:error] [pid 3677:tid 3029764928] [remote 192.168.1.5:27575] UnicodeEncodeError: 'ascii' codec can't encode character '\\xc7' in position 15: ordinal not in range(128)

Checking what has been logged
=============================

rse at Alibaba:~/test$ cat logbytes
userName=Ça va !               <--- this was indeed the input (notice the
                                    french C + cedilla)
                                    Unicode U+00C7    ALT-0199    UTF-8 C387
                                    Reading the logbytes file one can verify
                                    that Ç is indeed represented by the 2 bytes
                                    \xC3 and \x87
rse at Alibaba:~/test$ cat logstr
<class 'bytes'>    <class 'str'>    <class 'dict'>
rse at Alibaba:~/test$ cat logdict
rse at Alibaba:~/test$             <--- Obviously empty because of error

Trying similar code within the Python interpreter
=================================================

rse at Alibaba:~/test$ python
Python 3.4.0 (default, Jun 19 2015, 14:18:46)
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> di = {'userName': ['Ça va !']}    <--- A dictionary
>>> str(di)
"{'userName': ['Ça va !']}"           <--- and its string representation
>>> type(str(di))
<class 'str'>                   <--- Is a string indeed
>>> fi = open('essai', 'a')
>>> fi.write(str(di) + '\n')
26                              <--- It works well
>>> fi.close()
>>>

Checking what has been written
==============================

rse at Alibaba:~/test$ cat essai
{'userName': ['Ça va !']}       <--- The result is correct
rse at Alibaba:~/test$

No error if all ASCII
=====================

If the input is `userName=Rene` for instance then there is no error and the
`logdict' does indeed then contain the text of the dictionary
`{'userName': ['Rene']}`