[I18n-sig] Re: [Python-Dev] Unicode debate

Sin Hang Kin kentsin@poboxes.com
Thu, 4 May 2000 12:04:14 +0800


> > I don't think I've heard a good *argument* for this rule though.  "A
> > character is a character is a character" sounds like an axiom to me --
> > something you can't prove or disprove rationally.
>
> I don't see it as an axiom, but rather as a design decision you make to
> keep your language simple. Along the lines of "all values are objects"
> and (now) all integer values are representable with a single type. Are
> you happy with this?

No. A character is not just a character.

Got to google and make a search, the return result might be an example of
mixed encoding text:

Search engines index pages in their natural encoding, and present the result
as is, so the search result page will contain whatever encoding mixed in. If
you see JIS, ISO 8859, Hebrew, Thai, Utf-8, Big-5, GB2312, EUC, Shift-JIS
you would not be very surprise. So, if you argue that a character is a
character is a character, how would you handle such a mixed encoding text
mess?

No one can write an automatically convertion program for such text, only if
you can treated it as 8-bit bytes you can make use of it. Otherwise this is
a mess.

Backward compatibility is a must, not an extra feature we would like. At
least provide a way to handle these in python efficiently.

To be able to handle text in character basis is very convient to all,
especially to those do not care about i18n, for people who do i18n text
processing, they can build their own logic into the code and will not be
killed by suprise text. For those applications which is not well prepared,
the sudden arrival of ugly unexpected encoding will certainly fatal. Look
out the net,  you are well connected, and your world is pollued by things
from your connection. Isn't it beautiful?

Rgs,

Kent Sin