why isn't Unicode the default encoding?
and-google at doxdesk.com
and-google at doxdesk.com
Mon Mar 20 17:11:54 EST 2006
John Salerno wrote:
> So as it turns out, Unicode and UTF-8 are not the same thing?
Well yes. UTF-8 is one scheme in which the whole Unicode character
repertoire can be represented as bytes.
Confusion arises because Windows uses the name 'Unicode' in character
encoding lists, to mean UTF-16_LE, which is another encoding that can
store the whole Unicode character repertoire as bytes. However
UTF-16_LE is not any more definitively 'Unicode' than UTF-8 is.
Further confusion arises because the encoding 'UTF-16' can actually
mean two things that are deceptively different:
- Unicode characters stored natively in 16-bit units (using two
UTF-16 characters to represent characters outside of the Basic
Multilingual Plane)
- Either of the 8-bit encodings UTF-16_LE and UTF-16_BE, detected
automatically using a Byte Order Mark when loaded, or chosen
arbitrarily when saving
Yet more confusion arises because UTF-32 (which can reference any
Unicode character directly) has the same problem. And though
wide-unicode builds of Python understand the first meaning (unicode()
strings are stored natively as UTF-32), they don't support the 8-bit
encodings UTF-32_LE and UTF-32_BE. Phew!
To summarise: confusion.
> Am I right to say that UTF-8 stores the first 128 Unicode code points
> in a single byte, and then stores higher code points in however many
> bytes they may need?
That is correct.
To answer the original question, we're always going to need byte
strings. They're a fundamental part of computing and the need to
process them isn't going to go away. However as Unicode text
manipulation becomes a more common event than byte string processing,
it makes sense to change the default kind of string you get when you
type a literal.
Personally I would like to see byte strings available under an easy
syntax like b'...' and UTF-32 strings available as w'...', or something
like that - currently having u'...' mean either UTF-16 or UTF-32
depending on compile-time options is very very annoying to the few
kinds of programs that really do need to know the difference. But
whatever is chosen, it's all tasty Python 3000 future-soup and not
worth worrying about for the moment.
--
And Clover
mailto:and at doxdesk.com
http://www.doxdesk.com/
More information about the Python-list
mailing list