Changing filenames from Greeklish => Greek (subprocess complain)

Steven D'Aprano steve+comp.lang.python at pearwood.info
Mon Jun 3 02:46:46 EDT 2013


On Sun, 02 Jun 2013 22:05:28 -0700, Νικόλαος Κούρας wrote:

> Why subprocess fails when it has to deal with a greek flename? and that
> an indirect call too....

It doesn't. The command you are calling fails, not subprocess.


The code you show is this:


 /home/nikos/public_html/cgi-bin/metrites.py in ()
    217                 template = htmldata + counter
    218         elif page.endswith('.py'):
=>  219                 htmldata = subprocess.check_output( '/home/nikos/
public_html/cgi-bin/' + page )
    220                 template = htmldata.decode('utf-8').replace
( 'Content-type: text/html; charset=utf-8', '' ) + counter



The first step is to inspect the value of the file name. Normally I would 
just call print, but since this is live code, and a web server, you 
probably don't want to use print directly. But you can print to a file, 
and then inspect the file. Using logging is probably better, but here's a 
real quick and dirty way to get the same result:

elif page.endswith('.py'):
    name = '/home/nikos/public_html/cgi-bin/' + page
    print(name, file=open('/home/nikos/out.txt', 'w'))
    htmldata = subprocess.check_output(name)



Now inspect /tmp/out.txt using the text editor of your choice. What does 
it contain? Is the file name of the executable what you expect? Does it 
exist, and is it executable?


The next step, after checking that, is to check the executable .py file. 
It may contain a bug which is causing this problem. However, I think I 
can guess what the nature of the problem is.


The output you show includes:

    cmd = '/home/nikos/public_html/cgi-bin/files.py' 
    output = b'Content-type: text/html; charset=utf-8\n\n<bod...n 
position 74: surrogates not allowed\n\n-->\n\n' 


My *guess* of your problem is this: your file names have invalid bytes in 
them, when interpreted as UTF-8.

Remember, on a Linux system, file names are stored as bytes. So the file-
name-as-a-string need to be *encoded* into bytes. My *guess* is that 
somehow, when renaming your files, you gave them a name which may be 
correctly encoded in some other encoding, but not in UTF-8. Then, when 
you try to read the file names in UTF-8, you hit an illegal byte, half of 
a surrogate pair perhaps, and everything blows up.

Something like this:

py> s = "Νικόλαος Κούρας"
py> b = s.encode('ISO-8859-7')  # Oh oh, wrong encoding!
py> print(b)
b'\xcd\xe9\xea\xfc\xeb\xe1\xef\xf2 \xca\xef\xfd\xf1\xe1\xf2'
py> b.decode('UTF-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcd in position 0: 
invalid continuation byte


Obviously the error is a little different, because the original string is 
different.

If I am right, the solution is to fix the file names to ensure that they 
are all valid UTF-8 names. If you view the directory containing these 
files in a file browser that supports UTF-8, do you see any file names 
containing Mojibake?

http://en.wikipedia.org/wiki/Mojibake


Fix those file names, and hopefully the problem will go away.



-- 
Steven



More information about the Python-list mailing list