[Edu-sig] round trip with unicode (the Latin-1 neighborhood)

kirby urner kirby.urner at gmail.com
Sun Mar 16 03:16:24 CET 2014


Round Trip in UTF-8:  Two Byte Encodings

Explanation:

Latin-1 includes all of 127 7-bit ASCII and begins
the 2-byte encodings in UTF-8.  Below, Latin-1
character at code point 200 is represented as
bytes then broken down into bits.

When utf-8 needs two bytes (it might use up to six),
the leading byte begins 110 to signify that, and the
followup byte begins 10.  So the significant payload
bits are just 11001000 which Python shows is 200,
where we started.

http://youtu.be/vLBtrd9Ar28  (see chart @ 10:30)

Leading byte begins
0xxxxxxx - this is the only byte
110xxxxx - another byte after this
1110xxxx - two more bytes
11110xxx - three more bytes
111110xx - four more bytes
1111110x - a total of six bytes

x means 'room for payload'

Console session:

Python 3.2.3 (v3.2.3:3d0686d90f55, Apr 10 2012, 11:09:56)
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
sys.path.extend(['/Users/kurner/Documents'])

>>> import sys
>>> sys.getdefaultencoding()
'utf-8'
>>> chr(200)
'È'
>>> bytes(chr(200), encoding='utf-8')
b'\xc3\x88'
>>> bin(0xc3)  # left byte in bits
'0b11000011'
>>> bin(0x88)  # right byte in bits
'0b10001000'
>>> 0b11001000  # the encoded number
200
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/edu-sig/attachments/20140315/d59df07a/attachment.html>


More information about the Edu-sig mailing list