Changing filenames from Greeklish => Greek (subprocess complain)

Cameron Simpson cs at zip.com.au
Fri Jun 7 04:53:04 EDT 2013


On 07Jun2013 09:56, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= <nikos.gr33k at gmail.com> wrote:
| On 7/6/2013 4:01 πμ, Cameron Simpson wrote:
| >On 06Jun2013 11:46, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= <nikos.gr33k at gmail.com> wrote:
| >| Τη Πέμπτη, 6 Ιουνίου 2013 3:44:52 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε:
| >| > py> s = '999-Eυχή-του-Ιησού'
| >| > py> bytes_as_utf8 = s.encode('utf-8')
| >| > py> t = bytes_as_utf8.decode('iso-8859-7', errors='replace')
| >| > py> print(t)
| >| > 999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ
| >|
| >| errors='replace' mean dont break in case or error?
| >
| >Yes. The result will be correct for correct iso-8859-7 and slightly mangled
| >for something that would not decode smoothly.
|
| How can it be correct? We have encoded out string in utf-8 and then
| we tried to decode it as greek-iso? How can this possibly be
| correct?

Ok, not correct. But consistent. Safe to call.

If it is a valid iso-8859-7 sequence (which might cover everything,
since I expect it is an 8-bit 1:1 mapping from bytes values to a
set of codepoints, just like iso-8859-1) then it may decode to the
"wrong" characters, but the reverse process (characters encoded as
bytes) should produce the original bytes.  With a mapping like this,
errors='replace' may mean nothing; there will be no errors because
the only Unicode characters in play are all from iso-8859-7 to start
with. Of course another string may not be safe.

| >| You took the unicode 's' string you utf-8 bytestringed it.
| >| Then how its possible to ask for the utf8-bytestring to decode
| >| back to unicode string with the use of a different charset that the
| >| one used for encoding and thsi actually printed the filename in
| >| greek-iso?
| >
| >It is easily possible, as shown above. Does it make sense? Normally
| >not, but Steven is demonstrating how your "mv" exercises have
| >behaved: a rename using utf-8, then a _display_ using iso-8859-7.
|
| Same as above, i don't understand it at all, since different
| charsets(encodings) used in the encode/decode process.

Visually, the names will be garbage. And if you go:

  mv '999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ.mp3' '999-Eυχή-του-Ιησού.mp3'

while using the iso-8859-7 locale, the wrong thing will occur
(assuming it even works, though I think it should because all these
characters are represented in iso-8859-7, yes?)

Why?

In the iso-8859-7 locale, your (currently named under an utf-8
regime) file looks like '999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ.mp3' (because the
Unicode byte sequence maps to those characters in iso-8859-7). Why
you issue the about "mv" command, the new name will be the _iso-8859-7_
bytes encoding for '999-Eυχή-του-Ιησού.mp3'.  Which, under an utf-8
regime will decode to _other_ characters.

If you want to repair filenames, by which I mean, cause them to be correctly
encoded for utf-8, you are best to work in utf-8 (using "mv" or python).

Of course, the badly named files will then look wrong in your listing.

If you _know_ the filenames were written using iso-8859-7 encoding, and that the names are "right" under that encoding, you can write python code to rename them to utf-8.

Totally untested example code:

  import sys
  from binascii import hexlify

  for bytename in os.listdir( b'.' ):
    unicode_name = bytename.decode('iso-8859-7')
    new_bytename = unicode_name.encode('utf-8')
    print("%s: %s => %s" % (unicode_name, hexlify(bytename), hexlify(new_bytename)), file=sys.stderr)
    os.rename(bytename, new_bytename)

That code should not care what locale you are using because it uses
bytes for the file calls and is explicit about the encoding/decoding
steps.

| >| a) WHAT does it mean when a linux system is set to use utf-8?
| >
| >It means the locale settings _for the current process_ are set for
| >UTF-8. The "locale" command will show you the current state.
|
| That means that, when a linux application needs to saved a filename
| to the linux filesystem, the app checks the filesytem's 'locale', so
| to encode the filename using the utf-8 charset ?

At the command line, many will not. They'll just read and write bytes.

Some will decode/encode. Those that do, should by default use the
current locale.

But broadly, it is GUI apps that care about this because they must
translate byte sequences to glyphs: images of characters. So plenty
of command line tools do not need to care; the terminal application
is the one that presents the names to you; _it_ will decode them
for display. And it is the terminal app that translates your
keystrokes into bytes to feed to the command line.

NOTE: it is NOT the filesystem's locale. It is the current process'
locale, which is deduced from environment variables (which have
defaults if they are not set).

Under Windows I believe filesystems have locales; this can prevent
you storing some files on some filesystems on Windows, because the
filesystem doesn't cope. UNIX just takes bytes.

| And likewise when a linux application wants to decode a filename is
| also checking the filesystem's 'locale' setting so to know what
| charset must use to decode the filename correctly back to the
| original string?

Again, NOT the filesystem's locale. The process' locale. The
filesystem filenames are just bytes.

| So locale is used for filesystem itself and linux apps to know how
| to read(decode) and write(enode) filenames from/into the system's
| hdd?

NOT THE FILESYSTEM LOCALE. There is no filesystem locale.

If you look at:

  http://docs.python.org/3/library/sys.html#sys.getfilesystemencoding

you'll see if does not talk about a property of the filesystem, but
the behaviour that will be used when storing filenames.

| >| c) WHAT happens when the two of them try to work together?
| >
| >If everything matches, it is all good. If the locales do not match,
| >the mismatch will result in an undesired bytes<->characters
| >encode/decode step somewhere, and something will display incorrectly
| >or be entered as input incorrectly.
| 
| Cant quite grasp the idea:
| 
| local end: Win8,  locale = greek-iso
| remote end: CentOS 6.4,  locale = utf-8

What makes you think the remote end is utf-8?
When you say "locale = utf-8", _exactly_ what does that mean to you?

| FileZilla by default uses "do not know what charset" to upload filenames

Then at a guess it uploaded the filenames as greek-iso byte sequences.
The filenames on disc will be greek-iso byte sequences.

| Putty by default uses greek-iso to display filenames

Then it will look ok, superficially, I would expect.

| WHAT someone can expect to happen when all of the above work together?
| Mess of course, but i want to hear in detail each step of the mess
| as it emerges.

There are several steps, for example:

  FileZilla will pass filenames to the remote end (FTP, SFTP, maybe)
  as bytes.  What those bytes will be will depend on FileZilla.
  The UNIX end probably accepts them as-is and uses them directly.
  So the filenames on disc would probably be greek-iso byte sequences.

  Running a /bin/ls ("ls" without the alias, with no special options)
  should present these byte sequences to the Terminal, which will
  decode them using its locale (greek-iso?)

  Running a "/bin/ls -b" (using the -b option from the ls alias)
  will "print octal escapes for nongraphic characters". So "ls"
  must decide what are nongraphic characters. It does this by
  decoding the filenames using the _remote_ locale (its own locale).
  So it will decode the greek-iso byet sequences as though they
  were utf-8.  Anything in the ASCII range (1-127, which will
  represent the same characters in utf-8, iso-8859-1 or iso-8859-7),
  the boring Roman alphabet range, will be treated the same. But
  outside that range the byte sequence will be taken to mean different
  characters depending on the locale.
  So "ls -b" will decide some of the greek-iso byte sequences do not
  represent printable characters, and will decide to print octal.

  Experiment:

    LC_ALL=C ls -b
    LC_ALL=utf-8 ls -b
    LC_ALL=iso-8859-7 ls -b

  And the Terminal itself is decoding the output for display, and
  encoding your input keystrokes to feed as input to the command
  line.

You would be best setting your Windows box to UTF-8, matching how
you intend to work on the rmeote UNIX host. I do not know what
ramifications that may have for your local efilesystems of text
files.

Cheers,
-- 
Cameron Simpson <cs at zip.com.au>

Humans are incapable of securely storing high quality cryptographic
keys and they have unacceptable speed and accuracy when performing
cryptographic operations.  (They are also large, expensive to maintain
diffcult to manage and they pollute the environment.) It is astonishing
that these devices continue to be manufactured and deployed. But they
are suffciently pervasive that we must design our protocols around
their limitations.      - C Kaufman, R Perlman, M Speciner
                          _Network Security: PRIVATE Communication in a
                           PUBLIC World_, Prentice Hall, 1995, pp. 205.



More information about the Python-list mailing list