[Python-Dev] Re: [XML-SIG] Python 1.6a2 Unicode experiences?

Fredrik Lundh fredrik@pythonware.com
Fri, 28 Apr 2000 14:15:06 +0200


Christopher Petrilli wrote:
>=20
> Paul Prescod [paul@prescod.net] wrote:
> > > Even working with exotic languages, there is always a native
> > > 8-bit encoding.
> >
> > Unicode has many encodings: Shift-JIS, Big-5, EBCDIC ... You can use
> > 8-bit encodings of Unicode if you want.
>=20
> Um, if you go:
>=20
>     JIS -> Unicode -> JIS
>=20
> you don't get the same thing out that you put in (at least this is
> what I've been told by a lot of Japanese developers), and therefore
> it's not terribly popular because of the nature of the Japanese (and
> Chinese) langauge.
>=20
> My experience with Unicode is that a lot of Western people think it's
> the answer to every problem asked, while most asian language people
> disagree vehemently.  This says the problem isn't solved yet, even if
> people wish to deny it.

this is partly true, partly caused by a confusion over what unicode
really is.  there are at least two issues involved here:

* the unicode character repertoire is not complete

unicode contains all characters from the basic JIS X character
sets (please correct me if I'm wrong), but it doesn't include all
characters in common use in Japan.

as far as I've understood, this is mostly personal names and trade
names.  however, different vendors tend to use different sets,
with different encodings, and there has been no consensus on
which to add, and how.

so in other words, if you're "transcoding" from one encoding to
another (when converting data, or printing or displaying on a
device assuming a different encoding), unicode isn't good enough.

as MAL pointed out, you can work around this by using custom
codecs, mapping the vendor specific characters that you happen
to use to private regions in the unicode code space.  but afaik,
there is no standard way to do that at this time.

(this probably applies to other "CJK languages" too.  if anyone
could verify that, I'd be grateful).

* unicode is about characters, not languages

if you have a unicode string, you still don't know how to display
it.  the string tells you what characters to use, not what language
the text is written in.

and while using one standard "glyph" per unicode character works
pretty well for latin characters (no, it's not perfect, but it's not
much of a problem in real life), it doesn't work for asian languages.
you need extra language/locale information to pick the right glyph
for any given unicode character.

and the crux is that before unicode, this wasn't really a problem
-- if you knew the encoding, you knew what language to use.  when
using unicode, you need to put that information somewhere else
(in an XML attribute, for example).

* corrections and additions are welcome, of course.

</F>