[Python-3000] Character Set Indepencence

Wed Sep 13 06:10:38 CEST 2006

Paul Prescod schrieb:
> I think that the gist of it is that Unicode will be "just one character
> set" supported by Ruby. This idea has been kicked around for Python
> before but you quickly run into questions about how you compare
> character strings from multiple character sets, to say nothing of the
> complexity of an character encoding and character set agnostic regular
> expression engine.

As Guido says, the arguments for "CSI (character set independence)"
are hardly convincing. Yes, there are cases where Unicode doesn't
"round-trip", but they are so obscure that they (IMO) can be ignored
safely.

There are two problems in this respect with Unicode:
- in some cases, a character set may contain characters that are
  not included in Unicode. This was a serious problem for a while
  for Chinese for quite some time, but I believe this is now
  fixed, with the plane-2 additions. If just round-tripping is
  the goal, then it is always possible for a codec to map characters
  to the private-use areas of Unicode. This is not optimal,
  since a different codec may give a different meaning to the
  same PUA characters, but there should be rarely a need to
  use them in the first place.

- in some cases, the input encoding has multiple representations
  for what becomes the same character in Unicode. For example,
  in ISO-2022-jp, there are three ways to encode the latin
  letters (either in ASCII, or in the romaji part of
  either JIS X 0208-1978 or JIS X 0208-1983). You can switch
  between these in a single string; if you go back and forth
  through Unicode, you get a normalized version that
  .encode gives you. While I have seen people bringing it
  up now and then, I don't recall anybody claiming that this
  is a real, practical problem.

There is a third problem that people often associate with
Unicode: due to the Han unification, you don't know whether
a certain Han character originates from Chinese, Japanese,
or Korean. This is a problem when rendering Unicode: you
don't know what glyphs to use (as you should use different
glyphs depending on the natural language). With CSI, you
can use a "language-aware encoding": you use a Japanese
encoding for Japanese text, and so on, then use the encoding
to determine what the language is.

For Unicode, there are several ways to deal with it:
- you could carry language information along with the
  original text. This is what is commonly done in the
  web: you put language information into the HTML,
  and then use that to render the text correctly.
- you could embed language information into the Unicode
  string, using the plane-14 tag characters. This
  should work fairly nicely, since you only need
  a single piece of information, but has some drawbacks:
  * you need four-byte Unicode, or surrogates
  * if you slice such a string, the slices won't
    carry the language tag
  * applications today typically don't know how to
    deal with tag characters
- you could guess the language from the content, based
  on the frequency of characters (e.g. presence
  of katakana/hiragana would indicate that it is
  Japanese). As with all guessing, there are
  cases where it fails. I believe that web browsers
  commonly apply that approach, anyway.

Regards,
Martin