decode unicode string using 'unicode_escape' codecs
Steven Bethard
steven.bethard at gmail.com
Fri Jan 13 03:01:18 EST 2006
aurora wrote:
> I have some unicode string with some characters encode using python
> notation like '\n' for LF. I need to convert that to the actual LF
> character. There is a 'unicode_escape' codec that seems to suit my purpose.
>
>>>> encoded = u'A\\nA'
>>>> decoded = encoded.decode('unicode_escape')
>>>> print len(decoded)
> 3
>
> Note that both encoded and decoded are unicode string. I'm trying to
> use the builtin codec because I assume it has better performance that
> for me to write pure Python decoding. But I'm not converting between
> byte string and unicode string.
>
> However it runs into problem in some cases.
>
> encoded = u'€\\n€'
> decoded = encoded.decode('unicode_escape')
> Traceback (most recent call last):
> File "g:\bin\py_repos\mindretrieve\trunk\minds\x.py", line 9, in ?
> decoded = encoded.decode('unicode_escape')
> UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in
> position 0: ordinal not in range(128)
Does this do what you want?
>>> u'€\\n€'
u'\x80\\n\x80'
>>> len(u'€\\n€')
4
>>> u'€\\n€'.encode('utf-8').decode('string_escape').decode('utf-8')
u'\x80\n\x80'
>>> len(u'€\\n€'.encode('utf-8').decode('string_escape').decode('utf-8'))
3
Basically, I convert the unicode string to bytes, escape the bytes using
the 'string_escape' codec, and then convert the bytes back into a
unicode string.
HTH,
STeVe
More information about the Python-list
mailing list