|Jeremy Hylton : weblog : 2003-10-15|
Wednesday, October 15, 2003
I have been struggling with Unicode for my weblog aggregator. There are several feeds that include Unicode data in the title or description. I tried to generate HTML output with a UTF-8 encoding, but that didn't seem to work.
I thought part of the problem was that I didn't know anything about Unicode. For better or worse, I learned that I knew almost everything that a working programmer should know on the subject. The only thing I learned Joel Spolsky's nice essay on encoding issues is what a code point is. (I had a vague notion, but I didn't really know.)
The aggregator generates an HTML page including the description elements from all of the source feeds. All of the Unicode I've run into could be encoded as Latin-1, but UTF-8 seemed more general. I generate the HTML with print statements. (I should probably use a template system like ZPT, but I'm not familiar with any of them and the HTML is pretty simple.)
You can add a meta tag to the HTML page to specify a content-type and charset; iso-8859-1 is Latin-1.
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
I had been using "charset=utf-8" but that was not getting interpreted as I expected. I tried Mozilla and IE. It ends up displaying funny characters like an a with a hat, something I don't recognize, and a superscript TM. (I see the same thing in the current display for Python Programmer Weblogs.) When I changed to Latin-1, the characters started displaying properly -- umlauts and all.
The big pain is that you can't just write a line of code like
print '<a href="%(guid)s">%(title)s</a>' % dbecause the title might contain unprintable Unicode data. You need to specify an encoding by calling the encode() method of the Unicode string.
print ("<a href=\"%(guid)s\">%(title)s</a>" % d).encode("iso-8859-1")It's just clumsy to have to write all that. Isn't there some way to specify that stdout should use iso-8859-1 to encode all Unicode strings?