Newbie question about text encoding

Chris Angelico rosuav at gmail.com
Sun Mar 8 16:55:44 EDT 2015


On Mon, Mar 9, 2015 at 5:25 AM, Steven D'Aprano
<steve+comp.lang.python at pearwood.info> wrote:
> Marko Rauhamaa wrote:
>
>> Chris Angelico <rosuav at gmail.com>:
>>
>>> Once again, you appear to be surprised that invalid data is failing.
>>> Why is this so strange? U+DD00 is not a valid character.
>
> But it is a valid non-character code point.
>
>>> It is quite correct to throw this error.
>>
>> '\udd00' is a valid str object:
>
> Is it though? Perhaps the bug is not UTF-8's inability to encode lone
> surrogates, but that Python allows you to create lone surrogates in the
> first place. That's not a rhetorical question. It's a genuine question.

Ah, I see the confusion. Yes, it is plausible to permit the UTF-8-like
encoding of surrogates; but it's illegal according to the RFC:

https://tools.ietf.org/html/rfc3629
"""
   The definition of UTF-8 prohibits encoding character numbers between
   U+D800 and U+DFFF, which are reserved for use with the UTF-16
   encoding form (as surrogate pairs) and do not directly represent
   characters.
"""

They're not valid characters, and the UTF-8 spec explicitly says that
they must not be encoded. Python is fully spec-compliant in rejecting
these. Some encoders [1] will permit them, but the resulting stream is
invalid UTF-8, just as CESU-8 and Modified UTF-8 are (the latter being
"UTF-8, only U+0000 is represented as C0 80").

ChrisA

[1] eg http://pike.lysator.liu.se/generated/manual/modref/ex/predef_3A_3A/string_to_utf8.html
optionally



More information about the Python-list mailing list