Encoding of surrogate code points to UTF-8

Steven D'Aprano steve+comp.lang.python at pearwood.info
Tue Oct 8 18:30:41 EDT 2013


On Tue, 08 Oct 2013 15:14:33 +0000, Neil Cerutti wrote:

> In any case, "\ud800\udc01" isn't a valid unicode string. 

I don't think this is correct. Can you show me where the standard says 
that Unicode strings[1] may not contain surrogates? I think that is a 
critical point, and the FAQ conflates *encoded strings* (i.e. bytes using 
one of the UTCs) with *Unicode strings*.

The string you give above is is a Unicode string containing two code 
points, the surrogates U+D800 U+DC01, which as far as I am concerned is a 
legal string (subject to somebody pointing me to a definitive source that 
proves it is not). However, it *may or may not* be encodable to bytes 
using UTF-8, -16 or -32.

Just as there are byte sequences that cannot be generated by the UTFs, 
possibly there are code point sequences that cannot be converted to bytes 
using the UTFs.


> In a perfect
> world it would automatically get converted to '\u00010001' without
> intervention.

I certainly hope not, because Unicode string != UTF-16. This is 
equivalent to saying:

When encoding the sequence of code points '\ud800\udc01' to UTF-8 bytes, 
you should get the same result as if you treated the sequence of code 
points as if it were bytes, decoded it using UTF-16, and then encoded 
using UTF-8.

That would be a horrible, horrible design, since it privileges UTF-16 in 
a completely inappropriate way. I *really* hope I am wrong, but I fear 
that is my interpretation of the FAQ.



[1] Sequences of Unicode code points.


-- 
Steven



More information about the Python-list mailing list