unicode(s, enc).encode(enc) == s ?

Fri Dec 28 21:09:47 EST 2007

> Wow, that's not easy to see why would anyone ever want that? Is there
> any logic behind this?

It's the pre-Unicode solution to the "we want to have many characters
encoded in a single file" problem.

Suppose you have pre-defined characters sets A, B, C, and you want text
to contain characters from all three sets, one possible encoding is

<switch-to-A>CharactersInA<switch-to-B>CharactersFromB<and-so-on>

Now also suppose that A, B, and C are not completely different, but
have slight overlap - and you get ambiguous encodings.

ISO-2022 works that way. IPSJ maintains a registry of character
sets for ISO, and assigns escape codes to them. There are currently
about 200 character sets registered.

Somebody decoding this would have to know all the character sets
(remember it's a growing registry), hence iso-2022-jp restricts
the character sets that you can use for that particular encoding.
(Likewise, iso-2022-kr also restricts it, but to a different set
of sets).

It's a mess, sure, and one of the primary driving force of Unicode
(which even has the unification - ie. lack of ambiguity - in
its name).

> In your samples both of unicode("\x1b(BHallo","iso-2022-jp") and
> unicode("\x1b(JHallo","iso-2022-jp") give u"Hallo" -- does this mean
> that the ignored/lost bytes in the original strings are not illegal
> but *represent nothing* in this encoding?

See above, and Marc's explanation.
ESC ( B switches to "ISO 646, USA Version X3.4 - 1968";
ESC ( J to "ISO 646, Japanese Version for Roman Characters JIS C6220-1969"

These are identical, except for the following differences:
- The USA version has "reverse solidus" at 5/12; the Japanese
  version "Yen sign"
- The USA version has "Tilde (overline; general accent)" at
  7/14 (depicted as tilde); the Japanese version "Overline"
  (depicted as straight overline)
- The Japanese version specifies that you can switch between
  roman and katakana mode by sending shift out (SO, '\x0e')
  and shift-in (SI, '\x0F') respectively; this switches to
  the "JIS KATAKANA character set".
(source:
http://www.itscj.ipsj.or.jp/ISO-IR/006.pdf
http://www.itscj.ipsj.or.jp/ISO-IR/014.pdf
)

> I.e. in practice (in a context limited to the encoding in question)
> should this be considered as a data loss, or should these strings be
> considered "equivalent"?

These particular differences should be considered as irrelevant. There
are some cases where Unicode had introduced particular compatibility
characters to accommodate such encodings (specifically, the "full-width"
latin (*) and "half-width" Japanese characters). Good codecs are
supposed to round-trip the relevant differences to Unicode, and generate
the appropriate compatibility characters.

Bad codecs might not, and in some cases, users might complain that
certain compatibility characters are lacking in Unicode so that correct
round-tripping is not possible. I believe the Unicode consortium has
resolved all these complaints by adding the missing characters; but
I'm not sure.

Regards,
Martin

(*) As an example for full-width characters, consider these
two strings:
Ｈｅｌｌｏ
Hello
Should they be equivalent, or not? They are under NFKD, but not
NFD.