Unicode utf-8 doesn't do back-and-forth?

Martin v. Loewis martin at v.loewis.de
Tue Jul 2 03:22:43 EDT 2002


"Mike C. Fletcher" <mcfletch at rogers.com> writes:

> Well thinks I (stupid I), I'll just use the natural 8-bit transfer
> format for Unicode, utf-8.  When I do that, I get weird "Unicode
> errors".  For instance (simplified example):
> 
>  >>> unicode( u'\ud800\udb7f\udb80\U0010fc00\udfff'.encode('utf8'),'utf8')
> Traceback (most recent call last):
>    File "<stdin>", line 1, in ?
> UnicodeError: UTF-8 decoding error: unexpected code byte
>  >>> u'\ud800\udb7f\udb80\U0010fc00\udfff'.encode('utf8')
> '\xa0\x80\xad\xbf\xae\x80\xf4\x8f\xb0\x80\xed\xbf\xbf'
>  >>>
> 
> I was under the impression that utf-8 was supposed to be able to
> support any Unicode character with full back/forth translation. 

That impression is incorrect: Surrogates (as reserved in Unicode, for
use with UTF-16) are special.

Regards,
Martin




More information about the Python-list mailing list