[Chicago] understanding unicode problems

Fri Nov 16 17:10:47 CET 2007

On Friday November 16 2007 10:07:40 am Carl Karsten wrote:
> A string (unicode or not) is a bunch of bytes.  unicode chars may use more
> than one byte.  What I don't understand:  Why do I need to encode / decode?
>  I get the feeling the error caused is a reminder "so that you know that
> you need to do the other operation later."

I like to think of it this way:

"Unicode is the Platonic Ideal of text; Strings are the shadows on the wall."

Feel free to quote me.

Unicode characters exist in some abstract heavenly place.  They are about as 
pure a representation of text as you could conceive - no glyphs (fonts), no 
bytes, no memory representation (that you care about), nada.  Unicode can 
contain all characters now in existence or that will ever be.

Here on earth, we can't write such things to disk or email them around.  This 
is where strings come in.  Strings (more properly termed bytestrings) are 
simply that - a sequence of bytes.  They have an associated encoding, which 
is basically the alphabet of legal bytes.  The string itself doesn't know its 
encoding; you either need to be told that by an external mechanism or guess 
(often both).

encode() takes a unicode and produces a str
decode() takes a str and produces a unicode

You need to supply the source/destination encoding that your working under. 
The fact that both str & unicode objects have both methods in python doesn't 
help things.  There's a reason, but it's not very good.

Note that most encodings have a limited alphabet and are therefore not capable 
of representing the full range of unicode characters. utf8 (sometimes 
referred to incorrectly & unhelpfully as 'unicode') is a particular encoding 
for bytestrings.  It's the most comprehensive and most widely used, but it's 
not the only one. Other commonly seen encodings are us-ascii, latin-8, 
windows-1252.

When coding text handling apps, I find it's best to do all of your processing 
on unicode.  This means *decoding* as *soon* as possible (right after 
reading) and *encoding* as *late* as possible (just before writing).

Here's a little picture:

network => str => decode => unicode => munge => encode => disk

Hope this helps.  I've got some bookmarks at http://del.icio.us/pfein/unicode 
if it's still not clear.

-- 
Peter Fein   ||   773-575-0694   ||   pfein at pobox.com
http://www.pobox.com/~pfein/   ||   PGP: 0xCCF6AE6B
irc: pfein at freenode.net   ||   jabber: peter.fein at gmail.com