Unicode & mx.ODBC module

Thu Mar 4 21:10:04 EST 2004

Chuck Bearden wrote:
> I think I'm still not entirely clear on when Unicode encoding & 
> decoding happen in Python and for what reasons.  In my searching on this
> problem I kept my eye open for a nice, systematic treatment of Unicode
> in Python, but I haven't found anything yet.
It is not really tough, but you need to understand some facts that you
won't want to believe.

1) You are normally (when using str's) dealing with _bytes_, not
    _characters_ in strings.  Just because your system can print them
    doesn't mean someone else's system will print the same thing.

2) Unicode is a coding system for _characters_ and not binary values.
    Especially if you wander into the stranger sections of unicode, a
    single character may take several positions in a unicode string.

3) Deciding if two unicode strings are _the_same_ is a question of
    philosophy, and not just programming.

OK, with those caveats, you can pretend --
     unicode(some_byte_string, encoding) produces a unicode string.
     The byte string has no coding -- it is a sequence of bytes.  The
     coding is how you interpret those bytes to determine the characters
     that the bytes mean.

     Unicode, on the other hand, is a _character_ encoding.  In some
     sense, you should expect the unicode expression "unicode(s, enc)"
     to "mean" the same thing on all different computers that implement
     python.

     It really shouldn't matter what the bytes are in a unicode string,
     just like it shouldn't matter what the characters are in a byte
     string.

Please let me know whether this is:
   A) obvious,
   B) clear,
   C) comprehensible with effort
   D) gibberish

-- 
-Scott David Daniels
Scott.Daniels at Acm.Org