Changing filenames from Greeklish => Greek (subprocess complain)

Steven D'Aprano steve+comp.lang.python at pearwood.info
Thu Jun 6 08:44:52 EDT 2013


On Tue, 04 Jun 2013 02:00:43 -0700, Νικόλαος Κούρας wrote:

> Τη Τρίτη, 4 Ιουνίου 2013 11:47:01 π.μ. UTC+3, ο χρήστης Steven D'Aprano
> έγραψε:
> 
>> Please run these commands, and show what result they give:
[...]
> nikos at superhost.gr [~/www/data/apps]# alias ls 
> alias ls='/bin/ls $LS_OPTIONS'

And what does 

echo $LS_OPTIONS


give?

[...]
> Seems that the way the system used to actually rename the file matters.

Yes. This is where you get interactions between different systems that 
use different encodings, and they don't work well together.

Some day, everything will use UTF-8, and these problems will go away.


>> If all else fails, you could just rename the troublesome file and
>> hopefully the problem will go away:
>> mv *Ο.mp3 1.mp3
>> mv 1.mp3 Eυχή του Ιησού.mp3
> 
> Yes, but why you are doing it it 2 steps and not as:
> 
> mv *Ο.mp3 'Eυχή του Ιησού.mp3'

I don't remember. I had a reason that made sense at the time, but I can't 
remember what it was.


I think I can reproduce your problem. If I open a terminal, set to use 
UTF-8, I can do this:

[steve at ando ~]$ cd /tmp
[steve at ando tmp]$ touch '999-Eυχή-του-Ιησού'
[steve at ando tmp]$ ls 999*
999-Eυχή-του-Ιησού


Now if I change the terminal to use Greek ISO-8859-7, and hit UP-ARROW to 
grab the previous command line from history, the *displayed* file name 
changes, but the actual file being touched remains the same:

[steve at ando tmp]$ touch '999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ'
[steve at ando tmp]$ ls 999*
999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ


In Python 3.3, I can demonstrate the same sort of thing:

py> s = '999-Eυχή-του-Ιησού'
py> bytes_as_utf8 = s.encode('utf-8')
py> t = bytes_as_utf8.decode('iso-8859-7', errors='replace')
py> print(t)
999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ


So that demonstrates part of your problem: even though your Linux system 
is using UTF-8, your terminal is probably set to ISO-8859-7. The 
interaction between these will lead to strange and disturbing Unicode 
errors.


To continue, back in the terminal set to ISO-8859-7, if instead of using 
the history line, if I re-copy and paste the file name:

[steve at ando tmp]$ touch '999-Eυχή-του-Ιησού'
[steve at ando tmp]$ ls 999*
999-E???-???-?????  999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ


So now I end up with two files, one with a file name that is utter 
garbage bytes, and one that is only a little better, being mojibake.

Resetting the terminal to use UTF-8 at least now restores the *display* 
of the earlier file's name:

[steve at ando tmp]$ ls 999*
999-E???-???-?????  999-Eυχή-του-Ιησού
[steve at ando tmp]$ ls -b 999*
999-E\365\367\336-\364\357\365-\311\347\363\357\375  999-Eυχή-του-Ιησού

but the other file name is still made of garbage bytes.


So I believe I understand how your file name has become garbage. To fix 
it, make sure that your terminal is set to use UTF-8, and then rename it. 
Do the same with every file in the directory until the problem goes away.

(If one file has garbage bytes in the file name, chances are that more 
than one do.)


-- 
Steven



More information about the Python-list mailing list