Unicode characters

Diez B. Roggisch deets at nospam.web.de
Mon Sep 4 10:07:52 EDT 2006


Paul Johnston wrote:

> Hi
> I have a string which I convert into a list then read through it
> printing its glyph and numeric representation
> 
> #-*- coding: utf-8 -*-
> 
> thestring = "abcd"
> thelist = list(thestring)
> 
> for c in thelist:
>      print c,
>      print ord(c)
> 
> Works fine for latin characters but when I put in a unicode character
> a two byte character gives me two characters. For example an arabic
> alef returns
> 
> *  216
> * 167
> 
> ( the first asterix is the empty set symbol the second a double "s")
> 
> Putting in sequential characters i.e. alef, beh, teh mabuta, gives me
> sequential listings i.e.
> 216  167
> 216  168
> 216  169
> So it is reading the correct details.
> 
> 
> Is there anyway to get the c in the for loop to recognise it is
> reading a multiple byte character.
> I have followed the info in PEP 0263 and am using Python 2.4.3 Build
> 12 on a Windows box  within Eclipse 3.2.0 and Python plugins 1.2.2

Use unicode objects instead of byte strings. The above string literal is
_not_ affected by the coding:-header whatsoever.

That applies only to 

u"some text"

literals, and makes them a unicode object.

The normal string literals are just bytes - because of your encoding being
properly set in the editor, an entered multibyte-character is stored as
such.

In a nutshell: try the above using u"abcd".
Diez



More information about the Python-list mailing list