Changing filenames from Greeklish => Greek (subprocess complain)

Steven D'Aprano steve+comp.lang.python at pearwood.info
Fri Jun 7 11:33:31 EDT 2013


On Fri, 07 Jun 2013 04:53:42 -0700, Νικόλαος Κούρας wrote:

> Do you mean that utf-8, latin-iso, greek-iso and ASCII have the 1st
> 0-127 codepoints similar?

You can answer this yourself. Open a terminal window and start a Python 
interactive session. Then try it and see what happens:


s = ''.join(chr(i) for i in range(128))
bytes_as_utf8 = s.encode('utf-8')
bytes_as_latin1 = s.encode('latin-1')
bytes_as_greek_iso = s.encode('ISO-8859-7')
bytes_as_ascii = s.encode('ascii')

bytes_as_utf8 == bytes_as_latin1 == bytes_as_greek_iso == bytes_as_ascii


What result do you get? True or False?

And now you know the answer, without having to ask.


> For example char 'a' has the value of '65' for all of those character
> sets? Is hat what you mean?

You can answer that question yourself.

c = 'a'
for encoding in ('utf-8', 'latin-1', 'ISO-8859-7', 'ascii'):
    print(c.encode(encoding))


By the way, I believe that Python has made a strategic mistake in the way 
that bytes are printed. I think it leads to more confusion, not less. 
Better would be something like this:

c = 'a'
for encoding in ('utf-8', 'latin-1', 'ISO-8859-7', 'ascii'):
    print(hex(c.encode(encoding)[0]))


For historical reasons, most (but not all) charsets are supersets of 
ASCII. That is, the first 128 characters in the charset are the same as 
the 128 characters in ASCII.


> s = 'a'  (This is unicode right?  Why when we assign a string to a
> variable that string's type is always unicode 

Strings in Python 3 are Unicode strings. That's just the way Python 
works. Unicode was chosen because Unicode includes over a million 
different characters (well, potentially over a million, most of them are 
currently unused), and is a strict superset of *all* common legacy 
codepages from the old DOS and Windows 95 days.


> and does not automatically
> become utf-8 which includes all available world-wide characters? Unicode
> is something different that a character set? )

Unicode is a character set. It is an enormous set of over one million 
characters (technically "code point", but don't worry about the 
difference right now) which can be collected in strings.

UTF-8 is an encoding that goes from a string using the Unicode character 
set into bytes, and back again. Sometimes, people are lazy and say 
"UTF-8" when they mean "Unicode", or visa versa. 

UTF-16 and UTF-32 are two different encodings for the same purpose, but 
for various technical reasons UTF-8 is better for files.

'λ' is a character which exists in some charsets but not others. It is 
not in the ASCII charset, nor is it in Latin-1, nor Big-5. It is in the 
ISO-8859-7 charset, and of course it is in Unicode.

In ISO-8859-7, the character 'λ' is stored as byte 0xEB (decimal 235), 
just as the character 'a' is stored as byte 0x61 (decimal 97).

In UTF-8, the character λ is stored as two bytes 0xCE 0xBB.

In UTF-16 (big-endian), the character λ is stored as two bytes 0x03 0xBB.

In UTF-32 (big-endian), the character λ is stored as four bytes 0x00 0x00 
0x03 0xBB.

That's four different ways of "spelling" the same character as bytes, 
just as "three", "trois", "drei", "τρία", "três" are all different ways 
of spelling the same number 3.


> utf8_byte = s.encode('utf-8')
> 
> Now if we are to decode this back to utf8 we will receive the char 'a'.
> I beleive same thing will happen with latin, greek, ascii isos. Correct?

Why don't you try it for yourself and see?



> The characters that will not decode correctly are those that their
> codepoints are greater that > 127 ?

Maybe, maybe not. It depends on which codepoint, and which encodings. 
Some encodings use the same bytes for the same characters. Some encodings 
use different bytes. It all depends on the encoding, just like American 
and English both spell 3 "three", while French spells it "trois".


> for example if s = 'α' (greek character equivalent to english 'a')

In Latin-1, 'α' does not exist:

py> 'α'.encode('latin-1')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u03b1' in 
position 0: ordinal not in range(256)


In the old Windows Greek charset, ISO-8859-7, 'α' is stored as byte 0xE1:

py> 'α'.encode('ISO-8859-7')
b'\xe1'


But in the old Windows *Russian* charset, ISO-8859-5, the byte 0xE1 means 
a completely different character, CYRILLIC SMALL LETTER ES:

py> b'\xE1'.decode('ISO-8859-5')
'с'

(don't be fooled that this looks like the English c, it is not the same).


In Unicode, 'α' is always codepoint 0x3B1 (decimal 945):

py> ord('α')
945

but before you can store that on a disk, or as a file name, it needs to 
be converted to bytes, and which bytes you get depends on which encoding 
you use:

py> 'α'.encode('utf-8')
b'\xce\xb1'

py> 'α'.encode('utf-16be')
b'\x03\xb1'

py> 'α'.encode('utf-32be')
b'\x00\x00\x03\xb1'


-- 
Steven



More information about the Python-list mailing list