printing list containing unicode string

Xah Lee xah at xahlee.org
Mon Sep 10 22:26:20 EDT 2007


This post is about some notes and corrections to a online article
regarding unicod and python.

--------------

by happenstance i was reading:

Unicode HOWTO
http://www.amk.ca/python/howto/unicode

Here's some problems i see:

・ No conspicuous authorship. (however, oddly, it has a conspicuous
acknowledgement of names listing.) (This problem is a indirect
consequence of communism fanatism ushered by OpenSource movement)
(Originally i was just going to write to the author on some
corrections.)

・  It's very wasteful of space. In most texts, the majority of the
code points are less than 127, or less than 255, so a lot of space is
occupied by zero bytes.  

Not true. In Asia, most chars has unicode number above 255. Considered
globally, *possibly* today there are more computer files in Chinese
than in all latin-alphabet based lang.

・  Many Internet standards are defined in terms of textual data, and
can't handle content with embedded zero bytes. 

Not sure what he mean by "can't handle content with embedded zero
bytes". Overall i think this sentence is silly, and he's probably
thinking in unix/linux.

・  Encodings don't have to handle every possible Unicode
character, .... 

This is inane. A encoding, by definition, turns numbers into binary
numbers (in our context, it means a encoding handles all unicode chars
by definition). What he really meant to say is something like this:
"Practically speaking, most computer languages in western society
don't need to support unicode with respect to the language's source
file"

・  
UTF-8 has several convenient properties:
1. It can handle any Unicode code point.
...
 

As mentioned before, by definition, any Unicode encoding encodes all
unicode char set. The mentioning of above as a "convenient property"
is inane.

・  4.UTF-8 is fairly compact; the majority of code points are turned
into two bytes, and values less than 128 occupy only a single byte. 

Note here, that utf-8 is relative compact only if most of your text
are latin alphabets. If you are not a occidental men and you write
Chinese, utf-8 is comparatively inefficient. (utf-8 as one of the
Unicode encoding is probably comparatively inefficient for japanese,
korean, Arabic, or any non-latin-alphabet based langs)

Also note, the article overly focus on utf-8. Microsoft's Windows NT,
is probably the first major operating system that support unicode
throughly, and they use utf-16. For Much of America and Europe, which
are currently roughly the leader in computing, utf-8 is more efficient
in some sense (e.g. at least in disk space requirements). But consider
global computing, in particular Chinese & Japanese, utf-16 is overall
superior than utf-8.

Part of the reason, that utf-8 is favored in this article, has to do
with Linux (and unix). The reason unixes in general have choosen utf-8
instead of utf-16, is largely because unix is one motherfucking bag of
shit that it is impossible to support utf-16 without scraping a large
chuck of unix things.

PS I did not read the article in detail, but only roughly to see how
Python handle unicode because i was often confused by python's encode/
decode/unicode methods and functions.

... am gonna continue reading that article about Python specific
issues...

also note, this post is posted thru groups.google.com, and it contains
the double angled quotation mark chars. As of 2 weeks ago, it
quotation marks seems to be deleted in the process of posting, i.e.
unicode name: "LEFT-POINTING DOUBLE ANGLE QUOTATION MARK" and "RIGHT-
POINTING DOUBLE ANGLE QUOTATION MARK".  Here, i enclose the double-
angled quation mark inside a double curly quote: "  ". If inside the
double curly quote you see spaces, than that means google groups
fucked up.

References and Further readings:

・ Unicode in Perl & Python
http://xahlee.org/perl-python/unicode.html

・ the Journey of a Foreign Character thru Internet
http://xahlee.org/Periodic_dosage_dir/t2/non-ascii_journey.html

・ Unicode Characters Example
http://xahlee.org/Periodic_dosage_dir/t1/20040505_unicode.html

・ Python's unicodedata module
http://xahlee.org/perl-python/unicodedata_module.html

・ Emacs and Unicode Tips
http://xahlee.org/emacs/emacs_n_unicode.html

・ Java Tutorial: Unicode in Java
http://xahlee.org/java-a-day/unicode_in_java.html

・ Character Sets and Encoding in HTML
http://xahlee.org/js/html_chars.html

  Xah
  xah at xahlee.orghttp://xahlee.org/




More information about the Python-list mailing list