Newbie question about text encoding

Steven D'Aprano steve+comp.lang.python at pearwood.info
Mon Mar 9 02:34:55 EDT 2015


Chris Angelico wrote:

> As to the notion of rejecting the construction of strings containing
> these invalid codepoints, I'm not sure. Are there any languages out
> there that have a Unicode string type that requires that all
> codepoints be valid (no surrogates, no U+FFFE, etc)?

U+FFFE and U+FFFF are *noncharacters*, not invalid. There are a total of 66
noncharacters in Unicode, and they are legal in strings.

http://www.unicode.org/faq/private_use.html#nonchar8

I think the only illegal code points are surrogates. Surrogates should only
appear as bytes in UTF-16 byte-strings.



-- 
Steven




More information about the Python-list mailing list