Changing filenames from Greeklish => Greek (subprocess complain)

Cameron Simpson cs at zip.com.au
Thu Jun 6 21:01:22 EDT 2013


On 06Jun2013 11:46, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= <nikos.gr33k at gmail.com> wrote:
| Τη Πέμπτη, 6 Ιουνίου 2013 3:44:52 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε:
| > py> s = '999-Eυχή-του-Ιησού'
| > py> bytes_as_utf8 = s.encode('utf-8')
| > py> t = bytes_as_utf8.decode('iso-8859-7', errors='replace')
| > py> print(t) 
| > 999-EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ
| 
| errors='replace' mean dont break in case or error?

Yes. The result will be correct for correct iso-8859-7 and slightly mangled
for something that would not decode smoothly.

| You took the unicode 's' string you utf-8 bytestringed it.
| Then how its possible to ask for the utf8-bytestring to decode
| back to unicode string with the use of a different charset that the
| one used for encoding and thsi actually printed the filename in
| greek-iso?

It is easily possible, as shown above. Does it make sense? Normally
not, but Steven is demonstrating how your "mv" exercises have
behaved: a rename using utf-8, then a _display_ using iso-8859-7.

| > So that demonstrates part of your problem: even though your Linux system  
| > is using UTF-8, your terminal is probably set to ISO-8859-7. The  
| > interaction between these will lead to strange and disturbing Unicode 
| > errors.
| 
| Yes i feel this is the problem too. 
| Its a wonder to me why putty used by default greek-iso instead of utf-8 !!

Putty will get its terminal setting from the system you came from.
I suppose Windows of some kind. If you look at Putty's settings you
may be able to specify UTF-8 explicitly; not sure. If you can, do
that. At least there will be one less layer of confusion to debug.

| Please explain this t me because now that i begin to understand
| this encode/decode things i begin to like them!
| 
| a) WHAT does it mean when a linux system is set to use utf-8?

It means the locale settings _for the current process_ are set for
UTF-8. The "locale" command will show you the current state. There
will also be some system settings with defaults for stuff started
up by the system. On CentOS and RedHat that is probably the file:

  /etc/sysconfig/i18n

_However_, when you ssh in to the system using Putty or another ssh
client, the settings at your local end are passes to the remote ssh
session. In this way different people using different locales can
ssh in and get the locales they expect to use.

Of course, of the locale settings differ and these people are working
on the same files and text, madness will ensue.

| b) WHAT does it mean when a terminal client is set to use utf-8?

It means the _display_ end of the terminal will render characters
using UTF-8. Data comes from the remote system as a sequence of
bytes. The terminal receives these bytes and _decodes_ them using
utf-8 (or whatever) in order to decides what characters to display.

| c) WHAT happens when the two of them try to work together?

If everything matches, it is all good. If the locales do not match,
the mismatch will result in an undesired bytes<->characters
encode/decode step somewhere, and something will display incorrectly
or be entered as input incorrectly.

| > So I believe I understand how your file name has become garbage. To fix 
| > it, make sure that your terminal is set to use UTF-8, and then rename it. 
| > Do the same with every file in the directory until the problem goes away.
| 
| nikos at superhost.gr [~/www/cgi-bin]# echo $LS_OPTIONS
| --color=tty -F -a -b -T 0
| 
| Is this okey? The '-b' option is for to display a filename in binary mode?

Probably. "man ls" will tell you.

Personally, I "unalias ls" on RedHat systems (and any other system
where an alias has been set up). I want ls to do what I say, not
what someone else thought was a good idea.

| Indeed i have changed putty to use 'utf-8' and 'ls -l' now displays
| the file in correct greek letters. Switching putty's encoding back
| to 'greek-iso' then the *displayed* filanames shows in mojabike.

Exactly so.

| WHAT is being displayed and what is actually stored as bytes is two different thigns right?

Yes. Display requires the byte stream to be decoded. Wrong decoding
display wrong characters/glyphs.

| Ευχη του Ιησου.mp3
| EΟΟΞ�-ΟΞΏΟ-ΞΞ·ΟΞΏΟ
| 
| is the way the filaname is displayed in the terminal depending
| on the encoding the terminal uses, correct? But no matter *how* its
| being dislayed those two are the same file?

In principle, yes. Nothing has changed on the filesystem itself.

Cheers,
-- 
Cameron Simpson <cs at zip.com.au>

in rec.moto, jsh wrote:
> Dan Nitschke wrote:
> > Ged Martin wrote:
> > > On Sat, 17 May 1997 16:53:33 +0000, Dan Nitschke scribbled:
> > > >(And you stay *out* of my dreams, you deviant little
> > > >weirdo.)
> > > Yeah, yeah, that's what you're saying in _public_....
> > Feh. You know nothing of my dreams. I dream entirely in text (New Century
> > Schoolbook bold oblique 14 point), and never in color. I once dreamed I
> > was walking down a flowchart of my own code, and a waterfall of semicolons
> > was chasing me. (I hid behind a global variable until they went by.)
> You write code in a proportional serif? No wonder you got extra
> semicolons falling all over the place.
No, I *dream* about writing code in a proportional serif font.
It's much more exciting than my real life.
/* dan: THE Anti-Ged -- Ignorant Yank (tm) #1, none-%er #7 */
Dan Nitschke  peDANtic at best.com  nitschke at redbrick.com



More information about the Python-list mailing list