Unicode and string conversions

Salim Zayat zayats at blue.seas.upenn.edu
Fri Nov 16 16:54:17 EST 2001


This makes a little more sense.  But what I don't get is when I am only 
given a string to begin with, what can I do with it?  

For example, let's say I have a string 

>>>s = '\u0162'

to begin with.  If I run the steps illustrated here and in a few other 
places I have read, :

>>>us = unicode(s, 'utf-8')
or even
>>>us = unicode('\u0162', 'utf-8')

I get back :

>>>u'\\u0162'

Which is unfortunately not the same thing.  I read on the website that 
the \uXXXX tag in a string means a 32-bit hex value.  But I thought utf-8 
was an 8-bit character set (if I am not mistaken).

I am just a whole lot of confused.

Thanks.

Salim



Werner Schiendl (ws-news at gmx.at) wrote:
: Hi,

: you can convert (encode) an unicode string to an 8 bit encoding (a string)
: with the encode() method of the unicode string object.
: The reverse is possible with the builtin function unicode()

: e. g.

: >>> us=u'\u0621\u0622'
: >>> s=us.encode('utf-8')
: >>> s
: '\xd8\xa1\xd8\xa2'
: >>>
: >>> nus=unicode(s, 'utf-8')
: >>> nus
: u'\u0621\u0622'
: >>> print unicode.__doc__
: unicode(string [, encoding[, errors]]) -> object

: Create a new Unicode object from the given encoded string.
: encoding defaults to the current default string encoding and
: errors, defining the error handling, to 'strict'.
: >>> print us.encode.__doc__
: S.encode([encoding[,errors]]) -> string

: Return an encoded string version of S. Default encoding is the current
: default string encoding. errors may be given to set a different error
: handling scheme. Default is 'strict' meaning that encoding errors raise
: a ValueError. Other possible values are 'ignore' and 'replace'.
: >>>

: You need to specify which encoding should be used.
: The available encodings reside in the package 'codecs' of the Python
: distribution.

: hth
: Werner





More information about the Python-list mailing list