why isn't Unicode the default encoding?

"Martin v. Löwis" martin at v.loewis.de
Mon Mar 20 16:42:47 EST 2006


> I figured this might have something to do with it, but then again I 
> thought that Unicode was created as a subset of ASCII and Latin-1 so 
> that they would be compatible...but I guess it's never that easy. :)

The real problem is that the Python string type is used to represent
two very different concepts: bytes, and characters. You can't just drop
the current Python string type, and use the Unicode type instead - then
you would have no good way to represent sequences of bytes anymore.
Byte sequences occur more often than you might think: a ZIP file, a
MS Word file, a PDF file, and even an HTTP conversation are represented
through byte sequences.

So for a byte sequence, internal representation is important; for a
character string, it is not. Now, for historical reasons, the Python
string literals create byte strings, not character strings. Since we
cannot know whether a certain string literal is meant to denote bytes
or characters, we can't just change the interpretation.

Unicode is a superset of ASCII and Latin-1, but not of byte sequences.

Regards,
Martin



More information about the Python-list mailing list