Unicode and string conversions

Sat Nov 17 06:15:18 EST 2001

zayats at blue.seas.upenn.edu (Salim Zayat) writes:

> For example, let's say I have a string 
> 
> >>>s = '\u0162'
> 
> to begin with.  

Where did you get this string from? Why does it have to use \u escapes
to denote non-ASCII characters? Couldn't the string use encodings that
other people use as well (like Latin-1, UTF-8, KOI-8R, etc)?

> >>>us = unicode(s, 'utf-8')
> or even
> >>>us = unicode('\u0162', 'utf-8')
> 
> I get back :
> 
> >>>u'\\u0162'
> 
> Which is unfortunately not the same thing.  

It is exactly the same - in UTF-8. Every character (below 128) stands
for itself in UTF-8, so the backslash stands for a backslash, the u
stands for an u, etc - just as it does in the Unicode string.

> I am just a whole lot of confused.

It looks like this. If you absolutely *have* to use \u be treated as
an escape in a byte string, you can use the 'unicode-escape' encoding:

>>> unicode('\u0162','unicode-escape')
u'\u0162'

HTH,
Martin