[Python-Dev] Re: Re: Alternative Implementation for
PEP 292:SimpleString Substitutions
M.-A. Lemburg
mal at egenix.com
Tue Sep 14 15:56:09 CEST 2004
Terry Reedy wrote:
> "Fredrik Lundh" <fredrik at pythonware.com> wrote in message
> news:ci3g2d$m3g$1 at sea.gmane.org...
>
>>usually shorter in languages with many ideographs (my non-scientific
>>tests indicate that chinese text uses about 4 times less symbols than
>>english; I'm sure someone can dig up better figures).
>
> This is why I am not especially enamored of Unicode and the prospect of
> Python becoming married to it. It is heavily weighted in favor of
> efficiently representing Chinese and inefficiently representing English.
Hmm, the Asian world has a very different view on these things.
Representing English ASCII text in UTF-8 is very efficient (1-1), while
typical Asian texts use between 1.5-2 times as much space as their equivalent
in one of the resp. Asian encodings, e.g. take the Japanese translation
of the bible from (only parts of New Testament):
http://www.cozoh.org/denmo/
>>> bible = unicode(open('denmo.txt', 'rb').read(), 'shift-jis')
>>> len(bible)
386980
>>> len(bible.encode('utf-8'))
1008272
>>> len(bible.encode('shift-jis'))
697626
Some stats:
-----------
Number of unique code points: 1512
Code point frequency (truncated):
u'\u305f' : =================================
u' ' : =============================
u'\u306e' : ===========================
u'\uff0c' : ==========================
u'\r' : ========================
u'\n' : ========================
u'\u306b' : =====================
u'\u3044' : =================
u'\u3066' : =================
u'\u3057' : ================
u'\u3002' : ================
u'\u306f' : ================
u'\u306a' : ===============
u'\u3092' : ==============
u'\u3068' : ============
u'\u308b' : ============
u'\u3089' : ===========
u'\u3063' : ===========
u':' : ===========
u'}' : ===========
u'{' : ===========
u'\u304c' : ==========
u'\u308c' : ==========
u'\u304b' : =========
u'\u3067' : =========
u'1' : =========
u'\u5f7c' : ========
u'\u3053' : ========
u'\u3042' : =======
u'\u3061' : =======
u'\u3046' : =======
u'2' : =======
...
As you can see, most code points live in the 0x3000 area. These
code points require 3 bytes in UTF-8, 2 bytes in UTF-16.
> To give English equivalent treatment, the 20,000 or so most common words,
> roots, prefixes, and suffixes would each get its own codepoint.
I suggest you take this one up with the Unicode Consortium :-)
--
Marc-Andre Lemburg
eGenix.com
Professional Python Services directly from the Source (#1, Sep 14 2004)
>>> Python/Zope Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
More information about the Python-Dev
mailing list