Unicode is driving me nuts!

Sat Mar 13 11:19:16 EST 2004

    Anthony> Thank you, Skip.  You know what, I guess I'll give up using
    Anthony> unicode, as you also mentioned you used to have headache with
    Anthony> it.

    Anthony> I'll probably just read by bytes and check if the byte is a
    Anthony> Chinese character.  If it is, read 2 bytes instead.  What do
    Anthony> you think?  This way, I will hopefully not to have a lot of
    Anthony> unreadable characters.

I should have been clearer.  I wasn't suggesting that you not use Unicode.
That way lies madness.  Don't give up.  Unicode isn't intractable.  It is a
somewhat different way of thinking about handling text.  In my case I
already had a large set of Python programs and a MySQL database which knew
nothing about Unicode so converting them was more difficult than if I'd
started from scratch with Unicode (which wasn't available in Python when I
began Musi-Cal in 1995).  I tried a couple simple hacks which didn't work
(different than, but with the same shortcut idea in my head as your "read a
byte, sniff for a Chinese character" idea).  Once I bit the bullet and
converted to Unicode (and the utf-8 encoding when I needed raw bytes) it
wasn't as hard as I'd expected.  I think Python is still a little
schizophrenic in some regards, returning strings when the content is pure
ASCII and returning Unicode objects otherwise.  In some places you need to
test.  In the future I hope we see more of a string/Unicode convergence with
a separate byte object for character data that doesn't represent some kind
of text (like the string representation of code objects).

The two biggest problems which remain for me are:

    * I'm still using a Perl/Mason web front-end which doesn't do Unicode
      right

    * Web form submissions lack of consistent encoding information

I still have to sniff those inputs to guess at the encoding.  This is more
frequent than you might think even for a website <http://www.musi-cal.com>
which has a largely US/Canadian user base.

In short, you need a basic understanding of Unicode issues.  The Joel on
Software web page <http://www.joelonsoftware.com/articles/Unicode.html>
someone else posted is a decent start.  Googling for "python unicode" will
yield a nice tutorial from ReportLab and several other interesting links.
Play around with some small examples.

When designing your application consider all the possible input sources and
output destinations.  When writing text to files or databases that are not
Unicode-aware make sure you store everything using one encoding (I recommend
utf-8).  That way, other programs which read that data (or your own program
later on) can assume the encoding.  When reading input data make sure you
understand the auxiliary properties of the data source (for instance, grab
the content-type header from the HTTP response of pages you download from
the net - but be prepared to catch errors and guess, as it can sometimes be
wrong).  If working with data of unknown encoding you'll have to figure out
some heuristics for guessing the data encoding.  Did I mention you need to
know how all input and output text is encoded?  The most important thing to
remember (JoS points this out) is: It any nuthin' unless you know the
encoding.  Did I say that enough?

unicode('\xfe\xff\x00<\x00w\x00i\x00n\x00k\x00>', "utf-16").encode("ascii")

Skip