[I18n-sig] Re: I18n

Andy Robinson andy@robanal.demon.co.uk
Wed, 09 Feb 2000 02:10:40 GMT


On Tue, 08 Feb 2000 21:11:37 +0100, Jean-Claude Wippler wrote:
>I have an unrelated question: on developer day, someone in your i18n
>session, described why Unicode would not be acceptable in countries such
>as Japan.  I mentioned this to Cameron, who wants to know more.  But I
>lost the name/url of that person, can you help out?
>
Jean-Claude,  I have taken the liberty of forwarding this paragraph to
the new i18n-sig, which contains the people who made that remark!
Please join in to hear more...

Here is a very naive oversimplification, without most of the real
world mess:

There is a standard Japanese character set (Japan Industrial Standard
0208, or JIS-0208 for short) with  6879 characters, which has been
more or less unchanged since 1978.  They are defined in a logical
94x94 space (the 'kuten table'), with some holes in it.  This
character set is commonly encoded in three different ways, all of
which aim for backward-compatibility with ASCII:

1. Shift-JIS is the native encoding on Windows and the Mac, and for
about half of the Japanese HTML on the internet.  It basically says
'if the first byte you read is less than 128, it is ASCII; if it is
above 128 and between (various values), it is the first half of a
kanji'.  There is also a phonetic syllabaryt called "half-width
katakana" encoded in the top half of the code page.

2. EUC-JP (Extended Unix Coding-Japan) is the encoding on Unix, and
the other half of the web pages on the Internet :-)  It does something
similar; less than 128 is ASCII, and higher values are usually the
first half of a kanji.=20

3. JIS is an older encoding designed for mail and news.  It uses shift
sequences to indicate switching from double-byte to single-byte mode
and vice versa.

All three do not contain null bytes or control characters, so most
8-bit-safe software works fine with data in these encodings - you
might not be able to see Japanese in your English word processor, but
it will be preserved intact.  All three are very widely used, and are
the de facto encodings we have to deal with.

(those of us in the IBM world also have to cope with the DBCS-Host
encoding, which is a can of worms I won't afflict you with).

Because they all derive from the 'kuten table', there are neat
algorithmic conversions between them which run very fast and need no
lookup tables.  It is a very common requirement in Japanese IT to
convert between these - for example, to convert a directory of HTML
files from EUC to Shift-JIS.  If such a neat routine exists to go
directly, we don't want to have the overhead of going through Unicode.

Imagine we had a few higher-level functions on top of our encodings
API, such as convertString(data, input_encoding, output_encoding).
The default behaviour of such an encoding would be to go through
Unicode as a central point.  All we need for Japan is to say that if a
filter exists on your system which can go direct from EUC-JP to
Shift-JIS, use it rather than going through Unicode.  I am sure we can
accomodate this; MAL's spec defines a good API, and I think what we
need is a higher level on top of it.

The real world is messier than I have indicated, and there are
actually many corporate variations on the JIS0208 character set - IBM
and Microsoft add an extra 360 characters, NEC adds about 94, and
companies always define their own 'User-defined characters'.  This is
where Unicode breaks down badly.  These additions are in well-known
locations in the 'kuten table', but the mappings to Unicode are not
standard.  So if you need to go outside the strict JIS0208 character
set, you cannot trust Unicode to work as a 'central point'.  That's
when the direct filters are needed.   As an example of this, I worked
all last year on a project where we used the Microsoft character set
(360 characters bigger than JIS0208) plus a small set of user-defined
characters, but it all broke when we had to serve web pages through
Java's encoding libraries, which will not handle the extras.

As a more general point, the business requirements of someone working
in this field are usually to "move data from A to B", where A and B
are not Unicode.  Unicode is a very useful tool which can sit in the
middle most of the time, and Unicode character properties solve many
problems in the CJKV world - but not all of them.

There are also some common cleanup operations one can perform on
Japanese - equivalent to capitalisation, but messier - which can be
done either in Unicode with character properties, or directly.
Sometimes they have to be done directly.

That is why we poor double-byte people want to be able to take a look
at the API when it comes out, and maybe add a tweak or two - hopefully
in a separate layer over the top - and the right convenience functions
to make life easier.

Confused yet?  I could go on...  I will try to write up some decent
background documents over the course of this month.

By the way, if anyone has similar issues with other locales, let's
hear them!


- Andy