accessing individual characters in unicode strings

Sat Apr 12 11:17:23 EDT 2008

On Apr 12, 9:48 am, Christian Heimes <li... at cheimes.de> wrote:
> Peter Robinson schrieb:
>
> > Dear list
> > I am at my wits end on what seemed a very simple task:
> > I have some greek text, nicely encoded in utf8, going in and out of a
> > xml database, being passed over and beautifully displayed on the web.
> > For example: the most common greek word of all 'kai' (or και if your
> > mailer can see utf8)
> > So all I want to do is:
> > step through this string a character at a time, and do something for
> > each character (actually set a width attribute somewhere else for each
> > character)
>
> As John already said: UTF-8 ain't unicode. UTF-8 is an encoding similar
> to ASCII or Latin-1 but different in its inner workings. A single
> character may be encoded by up to 6 bytes.

 Up to 4 bytes in the latest versions. (the largest value is U+10FFFF
and is represented by 0xF4 0x8F 0xBF 0xBF).

 I believe the proper way for returning the number of characters for
Greek would require a normalization first:

from unicodedata import normalize
def greek_text_length(utf8_string):
      u = unicode(utf8_string, 'utf-8')
      u = normalize('NFC', u)
      return len(u)

 If there are pairs of characters that count as one, things may be
worse.

>
> I highly recommend Joel's article on unicode:
>
> The Absolute Minimum Every Software Developer Absolutely, Positively
> Must Know About Unicode and Character Sets (No Excuses!)http://www.joelonsoftware.com/articles/Unicode.html
>
> Christian