Changing filenames from Greeklish => Greek (subprocess complain)

Steven D'Aprano steve+comp.lang.python at pearwood.info
Sat Jun 8 15:01:57 EDT 2013


On Sat, 08 Jun 2013 21:01:23 +0300, Νικόλαος Κούρας wrote:

> In the beginning there was ASCII with 0-127 values 

No, there were encoding systems that existed before ASCII, such as 
EBCDIC. But we can ignore those, and just start with ASCII.


> and then there was
> Unicode with 0-127 of ASCII's + i dont know how much many more?

No, you have missed the utter chaos of dozens and dozens of Windows 
codepages and charsets. We still have to live with the pain of that.

But now we have Unicode, with 0x10FFFF (decimal 1114111) code points. You 
can consider a code point to be the same as a character, at least for now.


> Now ASCIII needs 1 byte to store a single character 

ASCII actually needs 7 bits to store a character. Since computers are 
optimized to work with bytes, not bits, normally ASCII characters are 
stored in a single byte, with one bit wasted.


> while Unicode needs 2 bytes to store a character 

No. Since there are 0x10FFFF different Unicode "characters" (really code 
points, but ignore the difference) two bytes is not enough. Unicode needs 
21 bits to store a character. Since that is more than 2 bytes, but less 
than 3, there are a few different ways for Unicode to be stored in 
memory, including:

"Wide" Unicode uses four bytes per character. Why four instead of three? 
Because computers are more efficient when working with chunks of memory 
that is a multiple of four.

"Narrow" Unicode uses two bytes per character. Since two bytes is only 
enough for about 65,000 characters, not 1,000,000+, the rest of the 
characters are stored as pairs of two-byte "surrogates".



> and that is because it has > 256 characters
> to store > 2^8bits ?

Correct.



> Now UTF-8, latin-iso, greek-iso e.t.c are WAYS of storing characters
> into the hard drive?

Your computer cannot carve a tiny little "A" into the hard drive when it 
stores that letter in a file. It has to write some bytes. So you need to 
know:

- what byte, or bytes, represents the letter "A"?

- what byte, or bytes, represents the letter "B"?

- what byte, or bytes, represents the letter "λ"?

and so on. This set of rules, "byte XXXX means letter YYYY", is called an 
encoding. If you don't know what encoding to use, you cannot tell what 
the byte means.

 
> Because in some post i have read that 'UTF-8 encoding of Unicode'. Can
> you please explain to me whats the difference of ASCII-Unicode
> themselves aand then of them compared to 'Charsets' . I'm still confused
> about this.

A charset is an ordered set of characters. For example, ASCII has 127 
characters, starting with NUL:

NUL ... A B C D E ... Z [ \ ] ^ ... a b c ... z ... 


where NULL is at position 0, 'A' is at position 65, 'B' at position 66, 
and so on.

Latin-1 is similar, except there are 256 positions. Greek ISO-8859-7 is 
also similar, also 256 positions, but the characters are different. And 
so on, with dozens of charsets.

And then there is Unicode, which includes *every* character is all of 
those dozens of charsets. It has 1114111 positions (most are currently 
unfilled).


An encoding is simply a program that takes a character and returns a 
byte, or visa versa. For instance, the ASCII encoding will take character 
'A'. That is found at position 65, which is 0x41 in hexadecimal, so the 
ASCII encoding turns character 'A' into byte 0x41, and visa versa.


> Is it like we said in C++:
> ' int a',     a variable with name 'a' of type integer. 'char a',   a
> variable with name 'a' of type char
> 
> So taken form above example(the closest i could think of), the way i
> understand them is:
> 
> A 'string' can be of (unicode's or ascii's) type and that type needs a
> way (thats a charset) to store this string into the hdd as a sequense of
> bytes?


Correct.



-- 
Steven



More information about the Python-list mailing list