accessing individual characters in unicode strings

Christian Heimes lists at cheimes.de
Sat Apr 12 08:48:02 EDT 2008


Peter Robinson schrieb:
> Dear list
> I am at my wits end on what seemed a very simple task:
> I have some greek text, nicely encoded in utf8, going in and out of a  
> xml database, being passed over and beautifully displayed on the web.   
> For example: the most common greek word of all 'kai' (or και if your  
> mailer can see utf8)
> So all I want to do is:
> step through this string a character at a time, and do something for  
> each character (actually set a width attribute somewhere else for each  
> character)

As John already said: UTF-8 ain't unicode. UTF-8 is an encoding similar
to ASCII or Latin-1 but different in its inner workings. A single
character may be encoded by up to 6 bytes.

I highly recommend Joel's article on unicode:

The Absolute Minimum Every Software Developer Absolutely, Positively
Must Know About Unicode and Character Sets (No Excuses!)
http://www.joelonsoftware.com/articles/Unicode.html

Christian



More information about the Python-list mailing list