Changing filenames from Greeklish => Greek (subprocess complain)

Steven D'Aprano steve+comp.lang.python at pearwood.info
Thu Jun 6 22:13:39 EDT 2013


On Thu, 06 Jun 2013 11:46:20 -0700, Νικόλαος Κούρας wrote:

> Τη Πέμπτη, 6 Ιουνίου 2013 3:44:52 μ.μ. UTC+3, ο χρήστης Steven D'Aprano
> έγραψε:
> 
>> py> s = '999-Eυχή-του-Ιησού' 
>> py> bytes_as_utf8 = s.encode('utf-8') 
>> py> t = bytes_as_utf8.decode('iso-8859-7', errors='replace') 
>> py> print(t)
>> 999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ
> 
> errors='replace' mean dont break in case or error? 

Please try reading the documentation for yourself before asking for help.

http://docs.python.org/3/library/stdtypes.html#bytes.decode


Yes, errors='replace' will mean that any time there is a decoding error, 
the official Unicode "U+FFFD REPLACEMENT CHARACTER" will be used instead 
of raising an error. Read the docs above, and follow the link, for more 
information.


> You took the unicode
> 's' string you utf-8 bytestringed it. 

The word is "encoded".

Encoding: Unicode string => bytes
Decoding: bytes => Unicode string


> Then how its possible to ask for
> the utf8-bytestring to decode back to unicode string with the use of a
> different charset that the one used for encoding and thsi actually
> printed the filename in greek-iso?

Bytes are bytes, no matter where they come from. Bytes don't remember 
whether they were from a Unicode string, or a float, or an integer, or a 
list of pointers. All they know is that they are a sequence of values, 
each value is 8 bits.

So bytes don't remember what charset (encoding) made them. If I have a 
set of bytes, I can *try* to do anything I like with them:

* decode those bytes as ASCII
* decode those bytes as UTF-8
* decode those bytes as ISO-8859-7
* decode those bytes as a list of floats
* decode those bytes as a binary tree of pointers

If the bytes are not actually ASCII, or UTF-8, etc., then I will get 
garbage, or an error.


>> So that demonstrates part of your problem: even though your Linux
>> system is using UTF-8, your terminal is probably set to ISO-8859-7. The
>> interaction between these will lead to strange and disturbing Unicode
>> errors.
> 
> Yes i feel this is the problem too.
> Its a wonder to me why putty used by default greek-iso instead of utf-8
> !!

Putty is probably getting the default charset from the Windows 8 system 
you are using, and Windows is probably using Greek ISO-8859-7 for 
compatibility with legacy data going back to Windows 95 or even DOS.

Someday everyone will use UTF-8, and this nonsense will be over.


> Please explain this t me because now that i begin to understand this
> encode/decode things i begin to like them!

Start here:

http://www.joelonsoftware.com/articles/Unicode.html

http://nedbatchelder.com/text/unipain.html



> a) WHAT does it mean when a linux system is set to use utf-8?

The Linux file system just treats file names as bytes. Any byte except 
0x00 and 0x2f (ASCII '\0' and '/') are legal in file names, so the Linux 
file system will store any other bytes.

But the applications on a Linux system don't work with bytes, they work 
with text strings. You want to see a file name like "My Music.mp3", not 
bytes like 0x4d79204d757369632e6d7033. So the applications need to know 
how to encode their text strings (file names) into bytes, and how to 
decode the file system bytes back into strings.

On Linux, there is a standard setting for doing this, the locale, which 
by default is set to use UTF-8 as the standard encoding. So well-behaved 
Linux applications will, directly or indirectly, interpret the bytes-on-
disk in file names as UTF-8, because that's what the locale tells them to 
do.

On Windows, there is a complete different setting for doing this, 
probably in the Registry.


> b) WHAT does it mean when a terminal client is set to use utf-8? 

Terminals need to accept bytes from the keyboard, and display them as 
text to the user. So they need to know what encoding to use to change 
bytes like 0x4d79204d757369632e6d7033 into something that is readable to 
a human being, "My Music.mp3". That is the encoding.


> c) WHAT happens when the two of them try to work together?

If they are set to the same encoding, everything just works.

If they are set to different encodings, you will probably have problems, 
just as you are having problems.


> nikos at superhost.gr [~/www/cgi-bin]# echo $LS_OPTIONS 
> --color=tty -F -a -b -T 0
> 
> Is this okey? The '-b' option is for to display a filename in binary
> mode?

That's fine.


> Indeed i have changed putty to use 'utf-8' and 'ls -l' now displays the
> file in correct greek letters. Switching putty's encoding back to
> 'greek-iso' then the *displayed* filanames shows in mojabike.
> 
> WHAT is being displayed and what is actually stored as bytes is two
> different thigns right?

Correct.

The bytes 0x200x40 means " @" (space at-sign) in ASCII or UTF-8, (and 
also many other encodings), but it means CJK UNIFIED IDEOGRAPH-4020 in 
UTF-16, it is invalid in UTF-32, and it means the number 32 as a 16-bit 
integer. Bytes are just sets of 8-bit values. The *meaning* of those 8-
bit values depends on you, not the bytes themselves.


> is the way the filaname is displayed in the terminal depending on the
> encoding the terminal uses, correct? But no matter *how* its being
> dislayed those two are the same file?

That's a hard question to answer. Sometimes yes, but not necessarily. It 
will depend on how the terminal works, and how confused it gets.



-- 
Steven



More information about the Python-list mailing list