Unicode utf-8 doesn't do back-and-forth?

Piet van Oostrum piet at cs.uu.nl
Mon Jul 8 06:17:51 EDT 2002


>>>>> "Mike C. Fletcher" <mcfletch at rogers.com> (MCF) writes:

MCF> Well, and here I was believing utf was a clean and elegant format to make
MCF> the best of a bad situation (I'm hoping utf-8 still is, though, of course,
MCF> it will have these dang surrogates to contend with ;) ).

MCF> "Nothing's clean, nothing's elegant kid.  Get used to that.  This is the
MCF> real world, and out here, we just hack at the corpses until they give us
MCF> what we want.  There're no master criminals any more, just frustrated
MCF> people, impossible situations, and no emotional air conditioning."

UTF-8 is still clean and elegant IMHO. But in UTF-8 not every byte
sequence is a valid code.

The problem with the surrogates is that they are not Unicode characters.
Rather they are halfs of Unicode characters in UTF-16 encoding. In UTF-16
encoding Unicode characters > 2**16 are encoded with two 16-bit
surrogates, the first one from the range 0xd800-0xdbff, the second one
from 0xdc00-0xdfff.

Your original input string u'\ud800\udb7f\udb80\U0010fc00\udfff' wasn't a
Unicode string, because the surrogates are not Unicode characters. Worse,
it wasn't even a valid UTF-16 encoding as it contains consecutive words
from the range 0xd800-0xdbff.

The encoding (or maybe even the Python parser) should have given an error
message. Instead, it produced an invalid UTF-8 byte sequence, which then
gives an error message at decoding. 
-- 
Piet van Oostrum <piet at cs.uu.nl>
URL: http://www.cs.uu.nl/~piet [PGP]
Private email: P.van.Oostrum at hccnet.nl



More information about the Python-list mailing list