Encoding of surrogate code points to UTF-8

Tue Oct 8 09:52:03 EDT 2013

I think this is a bug in Python's UTF-8 handling, but I'm not sure.

If I've read the Unicode FAQs correctly, you cannot encode *lone* 
surrogate code points into UTF-8:

http://www.unicode.org/faq/utf_bom.html#utf8-5

Sure enough, using Python 3.3:

py> surr = '\udc80'
py> surr.encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in 
position 0: surrogates not allowed

But reading the previous entry in the FAQs:

http://www.unicode.org/faq/utf_bom.html#utf8-4

I interpret this as meaning that I should be able to encode valid pairs 
of surrogates. So if I find a code point that encodes to a surrogate pair 
in UTF-16:

py> c = '\N{LINEAR B SYLLABLE B038 E}'
py> surr_pair = c.encode('utf-16be')
py> print(surr_pair)
b'\xd8\x00\xdc\x01'

and then use those same values as the code points, I ought to be able to 
encode to UTF-8, as if it were the same \N{LINEAR B SYLLABLE B038 E} code 
point. But I can't:

py> s = '\ud800\udc01'
py> s.encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in 
position 0: surrogates not allowed

Have I misunderstood? I think that Python is being too strict about 
rejecting surrogate code points. It should only reject lone surrogates, 
or invalid pairs, not valid pairs. Have I misunderstood the Unicode FAQs, 
or is this a bug in Python's handling of UTF-8?

-- 
Steven