A few questiosn about encoding

MRAB python at mrabarnett.plus.com
Thu Jun 20 13:17:12 EDT 2013


On 20/06/2013 17:37, Chris Angelico wrote:
> On Fri, Jun 21, 2013 at 2:27 AM,  <wxjmfauth at gmail.com> wrote:
>> And all these coding schemes have something in common,
>> they work all with a unique set of code points, more
>> precisely a unique set of encoded code points (not
>> the set of implemented code points (byte)).
>>
>> Just what the flexible string representation is not
>> doing, it artificially devides unicode in subsets and try
>> to handle eache subset differently.
>>
>
>
> UTF-16 divides Unicode into two subsets: BMP characters (encoded using
> one 16-bit unit) and astral characters (encoded using two 16-bit units
> in the D800::/5 netblock, or equivalent thereof). Your beloved narrow
> builds are guilty of exactly the same crime as the hated 3.3.
>
UTF-8 divides Unicode into subsets which are encoded in 1, 2, 3, or 4
bytes, and those who previously used ASCII still need only 1 byte per
codepoint!




More information about the Python-list mailing list