Micro Python -- a lean and efficient implementation of Python 3

Chris Angelico rosuav at gmail.com
Wed Jun 4 03:20:51 EDT 2014


On Wed, Jun 4, 2014 at 3:02 PM, Ian Kelly <ian.g.kelly at gmail.com> wrote:
> On Tue, Jun 3, 2014 at 10:40 PM, Rustom Mody <rustompmody at gmail.com> wrote:
>>> 1) Most or all Chinese and Japanese characters
>>
>> Dont know how you count 'most'
>>
>> | One possible rationale is the desire to limit the size of the full
>> | Unicode character set, where CJK characters as represented by discrete
>> | ideograms may approach or exceed 100,000 (while those required for
>> | ordinary literacy in any language are probably under 3,000). Version 1
>> | of Unicode was designed to fit into 16 bits and only 20,940 characters
>> | (32%) out of the possible 65,536 were reserved for these CJK Unified
>> | Ideographs. Later Unicode has been extended to 21 bits allowing many
>> | more CJK characters (75,960 are assigned, with room for more).
>>
>> | From http://en.wikipedia.org/wiki/Han_unification
>
> So there are 20,940 CJK characters in the BMP, and approximately
> 55,000 more in the SIP.  I'd count 55,000 out of 75,960 as "most".

And I said "or all" because I have this vague notion that either NFC
or NFD pushes stuff out of the BMP, although I may be wrong on that.
But certainly 55K/75K "with room for more" is the "most" that I was
talking about. (Maybe it isn't "most" by usage. After all, hypertext
documents are usually smaller in UTF-8 than in UTF-16, despite "most
characters" (counting purely by 21-bit space in codepoints) being more
compact in UTF-16; most by usage is of ASCII, because hypertext
involves a lot of punctuation and such. But still, there are a lot of
CJK that aren't in the BMP.)

ChrisA



More information about the Python-list mailing list