Unicode utf-8 doesn't do back-and-forth?
Piet van Oostrum
piet at cs.uu.nl
Mon Jul 8 06:17:51 EDT 2002
>>>>> "Mike C. Fletcher" <mcfletch at rogers.com> (MCF) writes:
MCF> Well, and here I was believing utf was a clean and elegant format to make
MCF> the best of a bad situation (I'm hoping utf-8 still is, though, of course,
MCF> it will have these dang surrogates to contend with ;) ).
MCF> "Nothing's clean, nothing's elegant kid. Get used to that. This is the
MCF> real world, and out here, we just hack at the corpses until they give us
MCF> what we want. There're no master criminals any more, just frustrated
MCF> people, impossible situations, and no emotional air conditioning."
UTF-8 is still clean and elegant IMHO. But in UTF-8 not every byte
sequence is a valid code.
The problem with the surrogates is that they are not Unicode characters.
Rather they are halfs of Unicode characters in UTF-16 encoding. In UTF-16
encoding Unicode characters > 2**16 are encoded with two 16-bit
surrogates, the first one from the range 0xd800-0xdbff, the second one
from 0xdc00-0xdfff.
Your original input string u'\ud800\udb7f\udb80\U0010fc00\udfff' wasn't a
Unicode string, because the surrogates are not Unicode characters. Worse,
it wasn't even a valid UTF-16 encoding as it contains consecutive words
from the range 0xd800-0xdbff.
The encoding (or maybe even the Python parser) should have given an error
message. Instead, it produced an invalid UTF-8 byte sequence, which then
gives an error message at decoding.
--
Piet van Oostrum <piet at cs.uu.nl>
URL: http://www.cs.uu.nl/~piet [PGP]
Private email: P.van.Oostrum at hccnet.nl
More information about the Python-list
mailing list