A few questiosn about encoding

Νικόλαος Κούρας support at superhost.gr
Thu Jun 13 03:42:40 EDT 2013


On 13/6/2013 10:11 πμ, Steven D'Aprano wrote:

>>   >>> chr(16474)
>> '䁚'
>>
>> Some Chinese symbol.
>> So code-point '䁚' has a Unicode ordinal value of 16474, correct?
>
> Correct.
>
>
>> where in after encoding this glyph's ordinal value to binary gives us
>> the following bytes:
>>
>>   >>> bin(16474).encode('utf-8')
>> b'0b100000001011010'

An observations here that you please confirm as valid.

1. A code-point and the code-point's ordinal value are associated into a 
Unicode charset. They have the so called 1:1 mapping.

So, i was under the impression that by encoding the code-point into 
utf-8 was the same as encoding the code-point's ordinal value into utf-8.

That is why i tried to:
bin(16474).encode('utf-8') instead of chr(16474).encode('utf-8')

So, now i believe they are two different things.
The code-point *is what actually* needs to be encoded and *not* its 
ordinal value.


> The leading 0b is just syntax to tell you "this is base 2, not base 8
> (0o) or base 10 or base 16 (0x)". Also, leading zero bits are dropped.

But byte objects are represented as '\x' instead of the aforementioned 
'0x'. Why is that?


 > No! That creates a string from 16474 in base two:
 > '0b100000001011010'

I disagree here.
16474 is a number in base 10. Doing bin(16474) we get the binary 
representation of number 16474 and not a string.
Why you say we receive a string while python presents a binary number?


> Then you encode the string '0b100000001011010' into UTF-8. There are 17
> characters in this string, and they are all ASCII characters to they take
> up 1 byte each, giving you bytes b'0b100000001011010' (in ASCII form).

0b100000001011010 stands for a number in base 2 for me not as a string.
Have i understood something wrong?





More information about the Python-list mailing list