Unicode & mx.ODBC module
Scott David Daniels
Scott.Daniels at Acm.Org
Thu Mar 4 21:10:04 EST 2004
Chuck Bearden wrote:
> I think I'm still not entirely clear on when Unicode encoding &
> decoding happen in Python and for what reasons. In my searching on this
> problem I kept my eye open for a nice, systematic treatment of Unicode
> in Python, but I haven't found anything yet.
It is not really tough, but you need to understand some facts that you
won't want to believe.
1) You are normally (when using str's) dealing with _bytes_, not
_characters_ in strings. Just because your system can print them
doesn't mean someone else's system will print the same thing.
2) Unicode is a coding system for _characters_ and not binary values.
Especially if you wander into the stranger sections of unicode, a
single character may take several positions in a unicode string.
3) Deciding if two unicode strings are _the_same_ is a question of
philosophy, and not just programming.
OK, with those caveats, you can pretend --
unicode(some_byte_string, encoding) produces a unicode string.
The byte string has no coding -- it is a sequence of bytes. The
coding is how you interpret those bytes to determine the characters
that the bytes mean.
Unicode, on the other hand, is a _character_ encoding. In some
sense, you should expect the unicode expression "unicode(s, enc)"
to "mean" the same thing on all different computers that implement
python.
It really shouldn't matter what the bytes are in a unicode string,
just like it shouldn't matter what the characters are in a byte
string.
Please let me know whether this is:
A) obvious,
B) clear,
C) comprehensible with effort
D) gibberish
--
-Scott David Daniels
Scott.Daniels at Acm.Org
More information about the Python-list
mailing list