Iterating over unicode strings

Sun Mar 10 23:57:47 EST 2002

Arun Sharma wrote:
>
> I would like to iterate over the following unicode string one character
> at a time.
>
> line = u"à²¡à²¾|| à²¶à²¿à²µà²°à²¾à²® à²•à²¾à²°à²‚à²¤"
> for c in line:
>     print c
>
> fails miserably. What is the right way to do it ? I would also like to
> be able to slice the string i.e. line[i] to get the i'th character.

Please post tracebacks when asking questions like this, to help us
troubleshoot.  

The print statement seems to want to print ASCII characters only, but
you are feeding it Unicode.  You might try the following to see what
you're up against:

for c in line:
    print ord(c)

You'll see (I believe) that some of the values are greater than 127,
and the print statement is trying to convert the values to ASCII.
If you specify an encoding for the data you might see something.
For example, print line.encode('utf7') ... or line.encode('iso-8859-3').
Depends what you are trying to do.  If you are trying to iterate
over every character and do something useful with it, but just
thought you'd test the loop by printing, then get rid of the print
and trust that the line is being iterated properly. :)

There's no reason you can't index into the string like line[i]
(note, that is not slicing) or even slice it with line[3:5], but
you can't use print on the result without further processing.

-Peter