Newbie question about text encoding

Sun Mar 8 14:25:02 EDT 2015

Marko Rauhamaa wrote:

> Chris Angelico <rosuav at gmail.com>:
> 
>> Once again, you appear to be surprised that invalid data is failing.
>> Why is this so strange? U+DD00 is not a valid character. 

But it is a valid non-character code point.

>> It is quite correct to throw this error.
> 
> '\udd00' is a valid str object:

Is it though? Perhaps the bug is not UTF-8's inability to encode lone
surrogates, but that Python allows you to create lone surrogates in the
first place. That's not a rhetorical question. It's a genuine question.

>    >>> '\udd00'
>    '\udd00'
>    >>> '\udd00'.encode('utf-32')
>    b'\xff\xfe\x00\x00\x00\xdd\x00\x00'
>    >>> '\udd00'.encode('utf-16')
>    b'\xff\xfe\x00\xdd'

If you explicitly specify the endianness (say, utf-16-be or -le) then you
don't get the BOMs.

> I was simply stating that UTF-8 is not a bijection between unicode
> strings and octet strings (even forgetting Python). Enriching Unicode
> with 128 surrogates (U+DC80..U+DCFF) establishes a bijection, but not
> without side effects.

-- 
Steven