Newbie question about text encoding

Rustom Mody rustompmody at gmail.com
Mon Mar 9 08:28:36 EDT 2015


On Monday, March 9, 2015 at 12:05:05 PM UTC+5:30, Steven D'Aprano wrote:
> Chris Angelico wrote:
> 
> > As to the notion of rejecting the construction of strings containing
> > these invalid codepoints, I'm not sure. Are there any languages out
> > there that have a Unicode string type that requires that all
> > codepoints be valid (no surrogates, no U+FFFE, etc)?
> 
> U+FFFE and U+FFFF are *noncharacters*, not invalid. There are a total of 66
> noncharacters in Unicode, and they are legal in strings.

Interesting -- Thanks!
I wonder whether that's one more instance of the anti-pattern (other thread)?
Number thats not a number -- Nan
Pointer that points nowhere -- NULL
SQL data thats not there but there -- null

> 
> http://www.unicode.org/faq/private_use.html#nonchar8
> 
> I think the only illegal code points are surrogates. Surrogates should only
> appear as bytes in UTF-16 byte-strings.

Even more interesting: So there's a whole hierarchy of illegality??
Could you suggest some good reference for 'surrogate'?



More information about the Python-list mailing list