A few questiosn about encoding

Nick the Gr33k support at superhost.gr
Fri Jun 14 03:28:32 EDT 2013


On 14/6/2013 9:00 πμ, Zero Piraeus wrote:
> :
>
> On 14 June 2013 01:34, Nick the Gr33k <support at superhost.gr> wrote:
>> Why doesn't it work like this?
>>
>> leading 0 = 1 byte flag
>> leading 1 = 2 bytes flag
>> leading 00 = 3 bytes flag
>> leading 01 = 4 bytes flag
>> leading 10 = 5 bytes flag
>> leading 11 = 6 bytes flag
>>
>> Wouldn't it be more logical?
>
> Think about it. Let's say that, as per your scheme, a leading 0
> indicates "1 byte" (as is indeed the case in UTF8). What things could
> follow that leading 0? How does that impact your choice of a leading
> 00 or 01 for other numbers of bytes?
>
> ... okay, you're obviously going to need to be spoon-fed a little more
> than that. Here's a byte:
>
>    01010101
>
> Is that a single byte representing a code point in the 0-127 range, or
> the first of 4 bytes representing something else, in your proposed
> scheme? How can you tell?

Indeed.

You cannot tell if it stands for 1 byte or a 4 byte sequence:

0 + 1010101 = leading 0 stands for 1byte representation of a code-point

01 + 010101 = leading 01 stands for 4byte representation of a code-point

the problem here in my scheme of how utf8 encoding works is that you 
cannot tell whether the flag is '0' or '01'

Same happen with leading '1' and '11'. You cannot tell what the flag is, 
so you cannot know if the Unicode code-point is being represented as 
2-byte sequence or 6 bye sequence

Understood


> Now look at the way UTF8 does it:
> <http://en.wikipedia.org/wiki/Utf-8#Description>
>
> Really, follow the link and study the table carefully. Don't continue
> reading this until you believe you understand the choices that the
> designers of UTF8 made, and why they made them.
>
> Pay particular attention to the possible values for byte 1. Do you
> notice the difference between that scheme, and yours:
>
>    0xxxxxxx
>    1xxxxxxx
>    00xxxxxx
>    01xxxxxx
>    10xxxxxx
>    11xxxxxx
>
> If you don't see it, keep looking until you do ... this email gives
> you more than enough hints to work it out. Don't ask someone here to
> explain it to you. If you want to become competent, you must use your
> brain.

0xxxxxxx
110xxxxx	10xxxxxx
1110xxxx	10xxxxxx	10xxxxxx
11110xxx	10xxxxxx	10xxxxxx	10xxxxxx

I did read the link but i still cannot see why

1. '110' is the flag for 2-byte code-point
2. why the in the 2nd byte and every subsequent byte leading flag has to 
be '10'

-- 
What is now proved was at first only imagined!



More information about the Python-list mailing list