"%s" vs unicode

Thu Jan 9 05:51:24 EST 2003

Gerd Woetzel <woetzel at gmd.de> writes:

> Unfortunately the "general principle" is wrong.
> There is a canonical embedding of Unicode strings into byte strings (which
> is UTF-8) but no canonical embedding of byte strings into Unicode strings.
> Hence it should be vice versa.

In the original Unicode proposal, there was no notion of a settable
default encoding (and this feature is still experimental); the default
encoding, at that time, was UTF-8.

Then people requested that byte-string-unicode-string conversion
should use other encodings, and it was pointed out that UTF-8 is maybe
confusing for existing applications. So the default encoding is now
administrator-settable, and defaults to ASCII.

With ASCII being the default encoding, there is *no* canonical
embedding of Unicode strings into byte strings: some Unicode strings
("most") cannot be converted to a byte string automatically.

The same is of cause true in the other direction: not all byte strings
can be converted to Unicode strings. For practical purposes,
converting byte strings works "more often", since the byte string will
be an ASCII string (literal) in many cases when combining it with a
Unicode string.

Regards,
Martin