A few questiosn about encoding

Chris Angelico rosuav at gmail.com
Thu Jun 20 13:21:39 EDT 2013


On Fri, Jun 21, 2013 at 3:17 AM, MRAB <python at mrabarnett.plus.com> wrote:
> On 20/06/2013 17:37, Chris Angelico wrote:
>>
>> On Fri, Jun 21, 2013 at 2:27 AM,  <wxjmfauth at gmail.com> wrote:
>>>
>>> And all these coding schemes have something in common,
>>> they work all with a unique set of code points, more
>>> precisely a unique set of encoded code points (not
>>> the set of implemented code points (byte)).
>>>
>>> Just what the flexible string representation is not
>>> doing, it artificially devides unicode in subsets and try
>>> to handle eache subset differently.
>>>
>>
>>
>> UTF-16 divides Unicode into two subsets: BMP characters (encoded using
>> one 16-bit unit) and astral characters (encoded using two 16-bit units
>> in the D800::/5 netblock, or equivalent thereof). Your beloved narrow
>> builds are guilty of exactly the same crime as the hated 3.3.
>>
> UTF-8 divides Unicode into subsets which are encoded in 1, 2, 3, or 4
> bytes, and those who previously used ASCII still need only 1 byte per
> codepoint!

Yes, but there's never (AFAIK) been a Python implementation that
represents strings in UTF-8; UTF-16 was one of two options for Python
2.2 through 3.2, and is the one that jmf always seems to be measuring
against.

ChrisA



More information about the Python-list mailing list