printing list containing unicode string

J. Cliff Dyer jcd at sdf.lonestar.org
Mon Sep 10 22:49:31 EDT 2007


Xah Lee wrote:
> This post is about some notes and corrections to a online article
> regarding unicod and python.
>
> --------------
>
> by happenstance i was reading:
>
> Unicode HOWTO
> http://www.amk.ca/python/howto/unicode
>
> Here's some problems i see:
>
> ・ No conspicuous authorship. (however, oddly, it has a conspicuous
> acknowledgement of names listing.) (This problem is a indirect
> consequence of communism fanatism ushered by OpenSource movement)
> (Originally i was just going to write to the author on some
> corrections.)
>
> ・ It's very wasteful of space. In most texts, the majority of the
> code points are less than 127, or less than 255, so a lot of space is
> occupied by zero bytes.
>
> Not true. In Asia, most chars has unicode number above 255. Considered
> globally, *possibly* today there are more computer files in Chinese
> than in all latin-alphabet based lang.
>
That's an interesting point. I'd be interested to see numbers on
that, and how those numbers have changed over the past five years.
Sadly, such data is most likely impossible to obtain.

However, it should be pointed out that most *code*, whether written in
the United States, New Zealand, India, China, or Botswana is written
in English. In part because it has become a standard of sorts, much
as italian was a standard for musical notation, due in part to the
US's former (and perhaps current, but certainly fading) dominance in
the field, and in part to the lack of solid support for unicode among
many programming languages and compilers. Thus the author's bias, while
inaccurate, is still understandable.

> ・ Many Internet standards are defined in terms of textual data, and
> can't handle content with embedded zero bytes.
>
> Not sure what he mean by "can't handle content with embedded zero
> bytes". Overall i think this sentence is silly, and he's probably
> thinking in unix/linux.
>
> ・ Encodings don't have to handle every possible Unicode
> character, ....
>
> This is inane. A encoding, by definition, turns numbers into binary
> numbers (in our context, it means a encoding handles all unicode chars
> by definition). What he really meant to say is something like this:
> "Practically speaking, most computer languages in western society
> don't need to support unicode with respect to the language's source
> file"
>
>> UTF-8 has several convenient properties:
> 1. It can handle any Unicode code point.
> ...
>
>
> As mentioned before, by definition, any Unicode encoding encodes all
> unicode char set. The mentioning of above as a "convenient property"
> is inane.
>
No, it's not inane. UCS-2, for example, is a fixed width, 2-byte
encoding that can handle any unicode code point up to 0xffff, but
cannot handle the 3 and 4 byte extension sets. UCS-2 was developed
for applications in which having fixed width characters is essential,
but has the limitations of not being able to handle any Unicode code
point. IIRC, when it was developed, it did handle every code point,
and then Unicode grew. There is also a UCS-4 to handle this
limitation. UTF-16 is based on a two-byte unit, but is variable
width, like UTF-8, which makes it flexible enough to handle any code
point, but harder to process, and a bear to seek through to a certain
point.

(I'm politely ignoring your ill-reasoned attacks on non-Microsoft OSes).

Cheers,
Cliff



More information about the Python-list mailing list