Unicode utf-8 doesn't do back-and-forth?
Mike C. Fletcher
mcfletch at rogers.com
Mon Jul 1 23:43:58 EDT 2002
I'm trying to create exhaustive lists of unicode character categories
for use by parsers. In doing so, I'm parsing the unicode database and
generating the category uni-strings from the category sets.
What I'd like to do is save the unicode strings in a python file for
easy access.
No problem thinks I (stupid I), I'll just write out a repr() of the
unicode object --> no go, when I change locale to "fr", I get errors
trying to import. The repr()'d string has lots of \x escapes, so I
imagine it's getting cute and attempting to reduce the number of
unicode-escaped chars by using > 127 chars in local-specific manner.
Well thinks I (stupid I), I'll just use the natural 8-bit transfer
format for Unicode, utf-8. When I do that, I get weird "Unicode
errors". For instance (simplified example):
>>> unicode( u'\ud800\udb7f\udb80\U0010fc00\udfff'.encode('utf8'),'utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeError: UTF-8 decoding error: unexpected code byte
>>> u'\ud800\udb7f\udb80\U0010fc00\udfff'.encode('utf8')
'\xa0\x80\xad\xbf\xae\x80\xf4\x8f\xb0\x80\xed\xbf\xbf'
>>>
I was under the impression that utf-8 was supposed to be able to support
any Unicode character with full back/forth translation. It doesn't seem
to here... Anyone clueful out there like to slap me upside the head with
my obvious mistake?
Enjoy,
Mike
_______________________________________
Mike C. Fletcher
http://members.rogers.com/mcfletch/
More information about the Python-list
mailing list