Unicode and string conversions
Salim Zayat
zayats at blue.seas.upenn.edu
Fri Nov 16 16:54:17 EST 2001
This makes a little more sense. But what I don't get is when I am only
given a string to begin with, what can I do with it?
For example, let's say I have a string
>>>s = '\u0162'
to begin with. If I run the steps illustrated here and in a few other
places I have read, :
>>>us = unicode(s, 'utf-8')
or even
>>>us = unicode('\u0162', 'utf-8')
I get back :
>>>u'\\u0162'
Which is unfortunately not the same thing. I read on the website that
the \uXXXX tag in a string means a 32-bit hex value. But I thought utf-8
was an 8-bit character set (if I am not mistaken).
I am just a whole lot of confused.
Thanks.
Salim
Werner Schiendl (ws-news at gmx.at) wrote:
: Hi,
: you can convert (encode) an unicode string to an 8 bit encoding (a string)
: with the encode() method of the unicode string object.
: The reverse is possible with the builtin function unicode()
: e. g.
: >>> us=u'\u0621\u0622'
: >>> s=us.encode('utf-8')
: >>> s
: '\xd8\xa1\xd8\xa2'
: >>>
: >>> nus=unicode(s, 'utf-8')
: >>> nus
: u'\u0621\u0622'
: >>> print unicode.__doc__
: unicode(string [, encoding[, errors]]) -> object
: Create a new Unicode object from the given encoded string.
: encoding defaults to the current default string encoding and
: errors, defining the error handling, to 'strict'.
: >>> print us.encode.__doc__
: S.encode([encoding[,errors]]) -> string
: Return an encoded string version of S. Default encoding is the current
: default string encoding. errors may be given to set a different error
: handling scheme. Default is 'strict' meaning that encoding errors raise
: a ValueError. Other possible values are 'ignore' and 'replace'.
: >>>
: You need to specify which encoding should be used.
: The available encodings reside in the package 'codecs' of the Python
: distribution.
: hth
: Werner
More information about the Python-list
mailing list