Unicode utf-8 doesn't do back-and-forth?

Piet van Oostrum piet at cs.uu.nl
Mon Jul 8 09:28:17 EDT 2002


>>>>> Piet van Oostrum <piet at cs.uu.nl> (PvO) writes:

PvO> Your original input string u'\ud800\udb7f\udb80\U0010fc00\udfff' wasn't a
PvO> Unicode string, because the surrogates are not Unicode characters. Worse,
PvO> it wasn't even a valid UTF-16 encoding as it contains consecutive words
PvO> from the range 0xd800-0xdbff.

PvO> The encoding (or maybe even the Python parser) should have given an error
PvO> message. Instead, it produced an invalid UTF-8 byte sequence, which then
PvO> gives an error message at decoding. 

Well, I looked into the Unicode specs and it says that even if single
surrogates appear in a string, the UTF-8 encoding should generate a valid
UTF-8 byte sequence, which on encoding should give the same surrogate. So
I would say this is a bug in the UTF-8 encoding.
-- 
Piet van Oostrum <piet at cs.uu.nl>
URL: http://www.cs.uu.nl/~piet [PGP]
Private email: P.van.Oostrum at hccnet.nl



More information about the Python-list mailing list