Changing filenames from Greeklish => Greek (subprocess complain)

Larry Hudson orgnut at yahoo.com
Tue Jun 11 03:20:19 EDT 2013


On 06/10/2013 01:11 AM, Νικόλαος Κούρας wrote:
> Τη Δευτέρα, 10 Ιουνίου 2013 10:51:34 π.μ. UTC+3, ο χρήστης Larry Hudson έγραψε:
>
>>> I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256.
>
>> 0 - 127, yes.
>> 128 - 255 -> one byte of a multibyte code.
>
> you mean that in utf-8 for 1 character to be stored, we need 2 bytes?
> I still havign troubl e understanding this.
>
Utf-8 characters are encoded in different sizes, NOT a single fixed number of bytes.
The high _bits_ of the first byte define the number of bytes of the individual character code.

(I'm copying this from Wikipedia...)
0xxxxxxx -> 1 byte
110xxxxx -> 2 bytes
1110xxxx -> 3 bytes
11110xxx -> 4 bytes
111110xx -> 5 bytes
1111110x -> 6 bytes

Notice that in the 1-byte version, since the high bit is always 0, only 7 bits are available for 
the character code, and this is the standard 0-127 ASCII (and ASCII-compatible) code set.

> Since 2^8 = 256, utf-8 would need 1 byte to store the 1st 256 characters but instead its using 1 byte up to the first 127 value and then 2 bytes for anyhtign above.  Why?
>
As I indicated above, one bit is reserved as a flag to indicate that the code is one-byte code 
and not a multibyte code, only 7 bits are available for the actual 1-byte (ASCII) code.




More information about the Python-list mailing list