Micro Python -- a lean and efficient implementation of Python 3

Wed Jun 4 03:10:34 EDT 2014

On Wed, Jun 4, 2014 at 5:00 PM, Terry Reedy <tjreedy at udel.edu> wrote:
> On 6/4/2014 1:55 AM, Ian Kelly wrote:
>>
>>
>> On Jun 3, 2014 11:27 PM, "Steven D'Aprano" <steve at pearwood.info
>> <mailto:steve at pearwood.info>> wrote:
>>  > For technical reasons which I don't fully understand, Unicode only
>>  > uses 21 of those 32 bits, giving a total of 1114112 available code
>>  > points.
>>
>> I think mainly it's to accommodate UTF-16. The surrogate pair scheme is
>> sufficient to encode up to 16 supplementary planes, so if Unicode were
>> allowed to grow any larger than that, UTF-16 would no longer be able to
>> encode all codepoints.
>
>
> I believe the original utf-8 used up to 6 bytes per char to encode 2**32
> potential chars. Just 4 bytes limits to 2**21 and for whatever reason
> (easier decoding?), utf-8 was revised down (unusual ;-).

I understood it to be UTF-16's fault, per Ian's statement. That is to
say, the entire Unicode standard was warped around the problem that
some people were going around thinking "a character is 16 bits", even
though that's just as fallacious as "a character is 8 bits".

ChrisA