A few questiosn about encoding

Nick the Gr33k support at superhost.gr
Fri Jun 14 01:34:37 EDT 2013


On 14/6/2013 1:46 πμ, Dennis Lee Bieber wrote:
> On Wed, 12 Jun 2013 09:09:05 +0000 (UTC), ???????? ??????
> <support at superhost.gr> declaimed the following:
>
>>>> (*) infact UTF8 also indicates the end of each character
>>
>>> Up to a point.  The initial byte encodes the length and the top few
>>> bits, but the subsequent octets aren’t distinguishable as final in
>>> isolation.  0x80-0xBF can all be either medial or final.
>>
>>
>> So, the first high-bits are a directive that UTF-8 uses to know how many
>> bytes each character is being represented as.
>>
>> 0-127 codepoints(characters) use 1 bit to signify they need 1 bit for
>> storage and the rest 7 bits to actually store the character ?
>>
> 	Not quite... The leading bit is a 0 -> which means 0..127 are sent
> as-is, no manipulation.

So, in utf-8, the leading bit which is a zero 0, its actually a flag to 
tell that the code-point needs 1 byte to be stored and the rest 7 bits 
is for the actual value of 0-127 code-points ?

>> 128-256 codepoints(characters) use 2 bit to signify they need 2 bits for
>> storage and the rest 14 bits to actually store the character ?
>>
> 	128..255 -- in what encoding? These all have the leading bit with a
> value of 1. In 8-bit encodings (ISO-Latin-1) the meaning of those values is
> inherent in the specified encoding and they are sent as-is.

So, latin-iso or greek-iso, the leading 0 is not a flag like it is in 
utf-8 encoding because latin-iso and greek-iso and all *-iso use all 8 
bits for storage?

But, in utf-8, the leading bit, which is 1, is to tell that the 
code-point needs 2 byte to be stored and the rest 7 bits is for the 
actual value of 128-255 code-points ?

But why 2 bytes? leading 1 is a flag and the rest 7 bits can hold the 
encoded value.

Bu that is not the case since we know that utf-8 needs 2 bytes to store 
code-points 127-255


> 	1110 starts a three byte sequence, 11110 starts a four byte sequence...
> Basically, count the number of leading 1-bits before a 0 bit, and that
> tells you how many bytes are in the multi-byte sequence -- and all bytes
> that start with 10 are supposed to be the continuations of a multibyte set
> (and not a signal that this is a 1-byte entry -- those only have a leading
> 0)

Why doesn't it work like this?

leading 0 = 1 byte flag
leading 1 = 2 bytes flag
leading 00 = 3 bytes flag
leading 01 = 4 bytes flag
leading 10 = 5 bytes flag
leading 11 = 6 bytes flag

Wouldn't it be more logical?


> Original UTF-8 allowed for 31-bits to specify a character in the Unicode
> set. It used 6 bytes -- 48 bits total, but 7 bits of the first byte were
> the flag (6 leading 1 bits and a 0 bit), and two bits (leading 10) of each
> continuation.

utf8 6 byted = 48 bits - 7 bits(from first bytes) - 2 bits(for each 
continuation) * 5 = 48 - 7 - 10 = 31 bits indeed to store the actual 
code-point. But 2^31 is still a huge number to store any kind of 
character isnt it?





-- 
What is now proved was at first only imagined!



More information about the Python-list mailing list