Unicode questions

Tue Oct 19 16:14:25 EDT 2010

On Tue, Oct 19, 2010 at 12:02 PM, Tobiah <toby at rcsreg.com> wrote:
> I've been reading about the Unicode today.
> I'm only vaguely understanding what it is
> and how it works.

Petite Abeille already pointed to Joel's excellent primer on the
subject; I can only second their endorsement of his article.

> Please correct my understanding where it is lacking.
<snip>
> Now for the mysterious encodings.  There is the UTF-{8,16,32}
> which only seem to indicate what the binary representation
> of the unicode character points is going to be.  Then there
> are 100 or so other encoding, many of which are language
> specific.  ASCII encoding happens to be a 1-1 mapping up
> to 127, but then there are others for various languages etc.
> I was thinking maybe this special case and the others were lookup
> mappings, where a
> particular language user could work with characters perhaps
> in the range of 0-255 like we do for ASCII, but then when
> decoding, to share with others, the plain unicode representation
> would be shared?

There is no such thing as "plain Unicode representation". The closest
thing would be an abstract sequence of Unicode codepoints (ala
Python's `unicode` type), but this is way too abstract to be used for
sharing/interchange, because storing anything in a file or sending it
over a network ultimately involves serialization to binary, which is
not directly defined for such an abstract representation (Indeed, this
is exactly what encodings are: mappings between abstract codepoints
and concrete binary; the problem is, there's more than one of them).

Python's `unicode` type (and analogous types in other languages) is a
nice abstraction, but at the C level it's actually using some
(implementation-defined, IIRC) encoding to represent itself in memory;
and so when you leave Python, you also leave this implicit, hidden
choice of encoding behind and must instead be quite explicit.

>  Why can't we just say "unicode is unicode"
> and just share files the way ASCII users do.

Because just "Unicode" itself is not a scheme for encoding characters
as a stream of binary. Unicode /does/ define many encodings, and these
encodings are such schemes; /but/ none of them is *THE* One True
Unambiguous Canonical "Unicode" encoding scheme. Hence, one must be
specific and specify "UTF-8", or "UTF-32", or whatever.

Cheers,
Chris
--
http://blog.rebertia.com