[Python-Dev] Re: Re: Alternative Implementation for PEP 292:SimpleString Substitutions

M.-A. Lemburg mal at egenix.com
Tue Sep 14 15:56:09 CEST 2004


Terry Reedy wrote:
> "Fredrik Lundh" <fredrik at pythonware.com> wrote in message 
> news:ci3g2d$m3g$1 at sea.gmane.org...
> 
>>usually shorter in languages with many ideographs (my non-scientific
>>tests indicate that chinese text uses about 4 times less symbols than
>>english; I'm sure someone can dig up better figures).
> 
> This is why I am not especially enamored of Unicode and the prospect of 
> Python becoming married to it.  It is heavily weighted in favor of 
> efficiently representing Chinese and inefficiently representing English. 

Hmm, the Asian world has a very different view on these things.

Representing English ASCII text in UTF-8 is very efficient (1-1), while
typical Asian texts use between 1.5-2 times as much space as their equivalent
in one of the resp. Asian encodings, e.g. take the Japanese translation
of the bible from (only parts of New Testament):

	http://www.cozoh.org/denmo/

 >>> bible = unicode(open('denmo.txt', 'rb').read(), 'shift-jis')
 >>> len(bible)
386980
 >>> len(bible.encode('utf-8'))
1008272
 >>> len(bible.encode('shift-jis'))
697626

Some stats:
-----------

Number of unique code points: 1512

Code point frequency (truncated):

u'\u305f' : =================================
u' '      : =============================
u'\u306e' : ===========================
u'\uff0c' : ==========================
u'\r'     : ========================
u'\n'     : ========================
u'\u306b' : =====================
u'\u3044' : =================
u'\u3066' : =================
u'\u3057' : ================
u'\u3002' : ================
u'\u306f' : ================
u'\u306a' : ===============
u'\u3092' : ==============
u'\u3068' : ============
u'\u308b' : ============
u'\u3089' : ===========
u'\u3063' : ===========
u':'      : ===========
u'}'      : ===========
u'{'      : ===========
u'\u304c' : ==========
u'\u308c' : ==========
u'\u304b' : =========
u'\u3067' : =========
u'1'      : =========
u'\u5f7c' : ========
u'\u3053' : ========
u'\u3042' : =======
u'\u3061' : =======
u'\u3046' : =======
u'2'      : =======
...

As you can see, most code points live in the 0x3000 area. These
code points require 3 bytes in UTF-8, 2 bytes in UTF-16.

> To give English equivalent treatment, the 20,000 or so most common words, 
> roots, prefixes, and suffixes would each get its own codepoint.

I suggest you take this one up with the Unicode Consortium :-)

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Sep 14 2004)
 >>> Python/Zope Consulting and Support ...        http://www.egenix.com/
 >>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
 >>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::


More information about the Python-Dev mailing list