Encoding of surrogate code points to UTF-8

Tue Oct 8 13:00:58 EDT 2013

On 08/10/2013 16:23, Pete Forman wrote:
> Steven D'Aprano <steve+comp.lang.python at pearwood.info> writes:
>
>> I think this is a bug in Python's UTF-8 handling, but I'm not sure.
> [snip]
>> py> s = '\ud800\udc01'
>> py> s.encode('utf-8')
>> Traceback (most recent call last):
>>   File "<stdin>", line 1, in <module>
>> UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in
>> position 0: surrogates not allowed
>>
>>
>> Have I misunderstood? I think that Python is being too strict about
>> rejecting surrogate code points. It should only reject lone surrogates,
>> or invalid pairs, not valid pairs. Have I misunderstood the Unicode FAQs,
>> or is this a bug in Python's handling of UTF-8?
>
> http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf
>
> D75 Surrogate pair: A representation for a single abstract character
>    that consists of a sequence of two 16-bit code units, where the first
>    value of the pair is a high-surrogate code unit and the second value
>    is a low-surrogate code unit.
>
> * Surrogate pairs are used only in UTF-16. (See Section 3.9, Unicode
>    EncodingForms.)
>
> * Isolated surrogate code units have no interpretation on their own.
>    Certain other isolated code units in other encoding forms also have no
>    interpretation on their own. For example, the isolated byte [\x80] has
>    no interpretation in UTF-8; it can be used only as part of a multibyte
>    sequence. (See Table 3-7). It could be argued that this line by itself
>    should raise an error.
>
>
> That first bullet indicates that it is indeed illegal to use surrogate
> pairs in UTF-8 or UTF-32.
>
The only time you should get a surrogate pair in a Unicode string is in
a narrow build, which doesn't exist in Python 3.3 and later.