Unicode characters

Mon Sep 4 10:07:47 EDT 2006

On 9/4/06, Paul Johnston <paul.johnston at manchester.ac.uk> wrote:
> Hi
> I have a string which I convert into a list then read through it
> printing its glyph and numeric representation
>
> #-*- coding: utf-8 -*-
>
> thestring = "abcd"
> thelist = list(thestring)
>
> for c in thelist:
>      print c,
>      print ord(c)
>
> Works fine for latin characters but when I put in a unicode character
> a two byte character gives me two characters. For example an arabic
> alef returns
>
> *  216
> * 167
>
> ( the first asterix is the empty set symbol the second a double "s")
>
> Putting in sequential characters i.e. alef, beh, teh mabuta, gives me
> sequential listings i.e.
> 216  167
> 216  168
> 216  169
> So it is reading the correct details.
>
>
> Is there anyway to get the c in the for loop to recognise it is
> reading a multiple byte character.
> I have followed the info in PEP 0263 and am using Python 2.4.3 Build
> 12 on a Windows box  within Eclipse 3.2.0 and Python plugins 1.2.2
>
If the string is not a unicode, it's be encoded in byte, so you can
only get the every character encoding of the string. You can conver it
to unicode, and if the character value less than 127, it should be an
ascii, otherwise maybe a multibytes character. for example:

a = 'string'
b = unicode(a, encoding_according_your_situation)
for i in b:
   if ord(i) < 127:
       print ord(i), 'ascii'
   else:
       print ord(i), 'multibytes'

-- 
I like python!
My Blog: http://www.donews.net/limodou
UliPad Site: http://wiki.woodpecker.org.cn/moin/UliPad
UliPad Maillist: http://groups.google.com/group/ulipad