printing list containing unicode string

Sion Arrowsmith siona at chiark.greenend.org.uk
Tue Sep 11 12:46:13 EDT 2007


Xah Lee  <xah at xahlee.org> wrote:
> "  It's very wasteful of space. In most texts, the majority of the
>code points are less than 127, or less than 255, so a lot of space is
>occupied by zero bytes. "
>
>Not true. In Asia, most chars has unicode number above 255. Considered
>globally, *possibly* today there are more computer files in Chinese
>than in all latin-alphabet based lang.

This doesn't hold water. There are many good reasons for preferring
UTF16 over UTF8, but unless you know you're only ever going to be
handling scripts from Unicode blocks above Arabic, it's reasonable
to assume that UTF8 will be at least as compact. Consider that
transcoding a Chinese file from UTF16 to UTF8 will probably increase
its size by 50% (the CJK ideograph blocks encode to 3 bytes). While
transcoding a document in a Western European langauge the other way
can be expected to increase its size by up to 100% (every single-
byte character is doubled). You'd have to be talking about double to
volume of CJK data before switching from UTF8 to UTF16 becomes even
a break-even proposition space-wise.

(It's curious to note that the average word length in English is
often taken to be 6 letters. Similarly, in UTF8-encoded Chinese the
average word length is 6 bytes....)

-- 
\S -- siona at chiark.greenend.org.uk -- http://www.chaos.org.uk/~sion/
   "Frankly I have no feelings towards penguins one way or the other"
        -- Arthur C. Clarke
   her nu becomeþ se bera eadward ofdun hlæddre heafdes bæce bump bump bump



More information about the Python-list mailing list