A few questiosn about encoding

Steven D'Aprano steve+comp.lang.python at pearwood.info
Thu Jun 13 03:11:08 EDT 2013


On Thu, 13 Jun 2013 09:09:19 +0300, Νικόλαος Κούρας wrote:

> On 13/6/2013 3:13 πμ, Steven D'Aprano wrote:

>> Open an interactive Python session, and run this code:
>>
>> c = ord(16474)
>> len(c.encode('utf-8'))
>>
>>
>> That will tell you how many bytes are used for that example.
> This si actually wrong.
> 
> ord()'s arguments must be a character for which we expect its ordinal
> value.

Gah! 

That's twice I've screwed that up. Sorry about that!


>  >>> chr(16474)
> '䁚'
> 
> Some Chinese symbol.
> So code-point '䁚' has a Unicode ordinal value of 16474, correct?

Correct.

 
> where in after encoding this glyph's ordinal value to binary gives us
> the following bytes:
> 
>  >>> bin(16474).encode('utf-8')
> b'0b100000001011010'

No! That creates a string from 16474 in base two:

'0b100000001011010'

The leading 0b is just syntax to tell you "this is base 2, not base 8 
(0o) or base 10 or base 16 (0x)". Also, leading zero bits are dropped.

Then you encode the string '0b100000001011010' into UTF-8. There are 17 
characters in this string, and they are all ASCII characters to they take 
up 1 byte each, giving you bytes b'0b100000001011010' (in ASCII form). In 
hex form, they are:

b'\x30\x62\x31\x30\x30\x30\x30\x30\x30\x30\x31\x30\x31\x31\x30\x31\x30'

which takes up a lot more room, which is why Python prefers to show ASCII 
characters as characters rather than as hex.

What you want is:

chr(16474).encode('utf-8')


[...]
> Thus, there we count 15 bits left.
> So it says 15 bits, which is 1-bit less that 2 bytes. Is the above
> statements correct please?

No. There are 17 BYTES there. The string "0" doesn't get turned into a 
single bit. It still takes up a full byte, 0x30, which is 8 bits.


> but thinking this through more and more:
> 
>  >>> chr(16474).encode('utf-8')
> b'\xe4\x81\x9a'
>  >>> len(b'\xe4\x81\x9a')
> 3
> 
> it seems that the bytestring the encode process produces is of length 3.

Correct! Now you have got the right idea.




-- 
Steven



More information about the Python-list mailing list