Unicode questions

Chris Rebert clp2 at rebertia.com
Tue Oct 19 17:09:32 EDT 2010


On Tue, Oct 19, 2010 at 1:31 PM, Tobiah <toby at rcsreg.com> wrote:
>> There is no such thing as "plain Unicode representation". The closest
>> thing would be an abstract sequence of Unicode codepoints (ala Python's
>> `unicode` type), but this is way too abstract to be used for
>> sharing/interchange, because storing anything in a file or sending it
>> over a network ultimately involves serialization to binary, which is not
>> directly defined for such an abstract representation (Indeed, this is
>> exactly what encodings are: mappings between abstract codepoints and
>> concrete binary; the problem is, there's more than one of them).
>
> Ok, so the encoding is just the binary representation scheme for
> a conceptual list of unicode points.  So why so many?  I get that
> someone might want big-endian, and I see the various virtues of
> the UTF strains, but why isn't a handful of these representations
> enough?  Languages may vary widely but as far as I know, computers
> really don't that much.  big/little endian is the only problem I
> can think of.  A byte is a byte.  So why so many encoding schemes?
> Do some provide advantages to certain human languages?

UTF-8 has the virtue of being backward-compatible with ASCII.

UTF-16 has all codepoints in the Basic Multilingual Plane take up
exactly 2 bytes; all others take up 4 bytes. The Unicode people
originally thought they would only include modern scripts, so 2 bytes
would be enough to encode all characters. However, they later
broadened their scope, thus the complication of "surrogate pairs" was
introduced.

UTF-32 has *all* Unicode codepoints take up exactly 4 bytes. This
slightly simplifies processing, but wastes a lot of space for e.g.
English texts.

And then there are a whole bunch of national encodings defined for
backward compatibility, but they typically only encode a portion of
all the Unicode codepoints.

More info: http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings

Cheers,
Chris
--
Essentially, blame backward compatibility and finite storage space.
http://blog.rebertia.com



More information about the Python-list mailing list