[I18n-sig] Random thoughts on Unicode and Python

Andy Robinson andy@reportlab.com
Sat, 10 Feb 2001 23:43:06 -0000


> Both Shift-JIS and EUC-JP are 8-bit, multibyte encodings.
> You can use
> them on systems that are 8-bit clean and things "just
> work". You don't
> need to worry about embedded nulls or any other such noise.
> While you
> can't use len() to get the number of *characters* in a
> Shift-JIS/EUC-JP encoded string, you can find out how many "octets"
> are in it so you can loop over it and calculate the
> character length.
>
> In essence the Japanese (and Chinese and Koreans) are using the
> existing Python string type as a raw-byte string, and imposing the
> semantics over that.

That's my concern, and the thing I want to poll people on.
If Python "just works" for these users, and if we already offer
Unicode strings and a good codec library for people to use when they
want to, is there really a need to go further?

> Japanese and Chinese arguments against Unicode are often
> ideological:
> "It doesn't contain all of the characters we need." Of course they
> forget to mention that the character sets in regular use in these
> locales, JIS X 0201-1990, JIS X 0212-1990, GB 2312-80, and Big Five,
> are all represented in Unicode. The same is true for Korean: all of
> the hanja in KS C 5601 et al. are available in Unicode, as are the
> precomposed han'gul.

That's interesting. I have never heard that objection voiced before
and agree that it is unfounded.  I have seen objections based
on two specific families of problems:

(1) user defined characters:  the big three Japanese encodings
use the Kuten space of 94x94 characters. There are lots of slight
venddor variations on the basic JIS0208 character set, as well
as people adding new Gaiji in their office workgroups. Generic
conversion routines from, say, EUC to Shift-JIS still work
perfectly whether you use Shift-JIS, cp932, or cp932 plus
ten extra in-house characters.  Conversions to Unicode involve
selecting new codecs, or even making new ones, for all these
situations.

(2) slightly corrupt data: Let's say you are dealing with files
or database fields containing some truncated kanji.  If you
use 8-bit-clean strings and no conversion, the data will not
be corrupted or changed; if you try to magically convert
it to Unicode you will get error messages or possibly even
more corruption.  Maybe you're writing an app whose job is
to get text from machine A to machine B without changing it;
suddenly it will stop working.  I know people who spent
weeks debugging a VB print spooler which was cutting up
Postscript files containing kanji.

Suddenly upgrading to a new version of Python where all
your data undergoes invisible transformations to Unicode
and back is going to cause trouble for quite a few people.
Arguably, it is GOOD trouble which will force them to
standardise their character sets, document their extensions
and clean their data - but it it still going to be trouble.
It's a bit different in a language like Java which was
defined to be Unicode-based from day one.


- Andy