why isn't Unicode the default encoding?

and-google at doxdesk.com and-google at doxdesk.com
Mon Mar 20 17:11:54 EST 2006


John Salerno wrote:

> So as it turns out, Unicode and UTF-8 are not the same thing?

Well yes. UTF-8 is one scheme in which the whole Unicode character
repertoire can be represented as bytes.

Confusion arises because Windows uses the name 'Unicode' in character
encoding lists, to mean UTF-16_LE, which is another encoding that can
store the whole Unicode character repertoire as bytes. However
UTF-16_LE is not any more definitively 'Unicode' than UTF-8 is.

Further confusion arises because the encoding 'UTF-16' can actually
mean two things that are deceptively different:

  - Unicode characters stored natively in 16-bit units (using two
UTF-16 characters to represent characters outside of the Basic
Multilingual Plane)

  - Either of the 8-bit encodings UTF-16_LE and UTF-16_BE, detected
automatically using a Byte Order Mark when loaded, or chosen
arbitrarily when saving

Yet more confusion arises because UTF-32 (which can reference any
Unicode character directly) has the same problem. And though
wide-unicode builds of Python understand the first meaning (unicode()
strings are stored natively as UTF-32), they don't support the 8-bit
encodings UTF-32_LE and UTF-32_BE. Phew!

To summarise: confusion.

> Am I right to say that UTF-8 stores the first 128 Unicode code points
> in a single byte, and then stores higher code points in however many
> bytes they may need?

That is correct.

To answer the original question, we're always going to need byte
strings. They're a fundamental part of computing and the need to
process them isn't going to go away. However as Unicode text
manipulation becomes a more common event than byte string processing,
it makes sense to change the default kind of string you get when you
type a literal.

Personally I would like to see byte strings available under an easy
syntax like b'...' and UTF-32 strings available as w'...', or something
like that - currently having u'...' mean either UTF-16 or UTF-32
depending on compile-time options is very very annoying to the few
kinds of programs that really do need to know the difference. But
whatever is chosen, it's all tasty Python 3000 future-soup and not
worth worrying about for the moment.

-- 
And Clover
mailto:and at doxdesk.com
http://www.doxdesk.com/




More information about the Python-list mailing list