A few questiosn about encoding

Νικόλαος Κούρας support at superhost.gr
Thu Jun 13 02:21:28 EDT 2013


On 12/6/2013 11:30 μμ, Nobody wrote:
> On Wed, 12 Jun 2013 14:23:49 +0300, Νικόλαος Κούρας wrote:
>
>> So, how many bytes does UTF-8 stored for codepoints > 127 ?
>
> U+0000..U+007F  1 byte
> U+0080..U+07FF  2 bytes
> U+0800..U+FFFF 	3 bytes
>> =U+10000       4 bytes

'U' stands for Unicode code-point which means a character right?

How can you be able to tell up to what character utf-8 needs 1 byte or 2 
bytes or 3?


And some of the bytes' bits are used to tell where a code-points 
representations stops, right?  i mean if we have a code-point that needs 
2 bytes to be stored that the high bit must be set to 1 to signify that 
this character's encoding stops at 2 bytes.

I just know that 2^8 = 256, that's by first look 265 places, which mean 
256 positions to hold a code-point which in turn means a character.

We take the high bit out and then we have 2^7 which is enough positions 
for 0-127 standard ASCII. High bit is set to '0' to signify that char is 
encoded in 1 byte.

Please tell me that i understood correct so far.

But how about for 2 or 3 or 4 bytes?

Am i saying ti correct ?






More information about the Python-list mailing list