Unicode utf-8 doesn't do back-and-forth?

Mike C. Fletcher mcfletch at rogers.com
Mon Jul 1 23:43:58 EDT 2002


I'm trying to create exhaustive lists of unicode character categories 
for use by parsers.  In doing so, I'm parsing the unicode database and 
generating the category uni-strings from the category sets.

What I'd like to do is save the unicode strings in a python file for 
easy access.

No problem thinks I (stupid I), I'll just write out a repr() of the 
unicode object --> no go, when I change locale to "fr", I get errors 
trying to import.  The repr()'d string has lots of \x escapes, so I 
imagine it's getting cute and attempting to reduce the number of 
unicode-escaped chars by using > 127 chars in local-specific manner.

Well thinks I (stupid I), I'll just use the natural 8-bit transfer 
format for Unicode, utf-8.  When I do that, I get weird "Unicode 
errors".  For instance (simplified example):

 >>> unicode( u'\ud800\udb7f\udb80\U0010fc00\udfff'.encode('utf8'),'utf8')
Traceback (most recent call last):
   File "<stdin>", line 1, in ?
UnicodeError: UTF-8 decoding error: unexpected code byte
 >>> u'\ud800\udb7f\udb80\U0010fc00\udfff'.encode('utf8')
'\xa0\x80\xad\xbf\xae\x80\xf4\x8f\xb0\x80\xed\xbf\xbf'
 >>>

I was under the impression that utf-8 was supposed to be able to support 
any Unicode character with full back/forth translation.  It doesn't seem 
to here... Anyone clueful out there like to slap me upside the head with 
my obvious mistake?

Enjoy,
Mike
_______________________________________
   Mike C. Fletcher
   http://members.rogers.com/mcfletch/







More information about the Python-list mailing list