Newbie question about text encoding

Chris Angelico rosuav at gmail.com
Mon Mar 9 02:44:54 EDT 2015


On Mon, Mar 9, 2015 at 5:34 PM, Steven D'Aprano
<steve+comp.lang.python at pearwood.info> wrote:
> Chris Angelico wrote:
>
>> As to the notion of rejecting the construction of strings containing
>> these invalid codepoints, I'm not sure. Are there any languages out
>> there that have a Unicode string type that requires that all
>> codepoints be valid (no surrogates, no U+FFFE, etc)?
>
> U+FFFE and U+FFFF are *noncharacters*, not invalid. There are a total of 66
> noncharacters in Unicode, and they are legal in strings.
>
> http://www.unicode.org/faq/private_use.html#nonchar8
>
> I think the only illegal code points are surrogates. Surrogates should only
> appear as bytes in UTF-16 byte-strings.

U+FFFE would cause problems at the beginning of a UTF-16 stream, as it
could be mistaken for a BOM - that's why it's a noncharacter. But
sure, let's leave them out of the discussion. The question is whether
surrogates are legal or not.

ChrisA



More information about the Python-list mailing list