Newbie question about text encoding

Sun Mar 8 04:23:37 EDT 2015

On Sun, Mar 8, 2015 at 7:09 PM, Marko Rauhamaa <marko at pacujo.net> wrote:
> Chris Angelico <rosuav at gmail.com>:
>
>> Once again, you appear to be surprised that invalid data is failing.
>> Why is this so strange? U+DD00 is not a valid character. It is quite
>> correct to throw this error.
>
> '\udd00' is a valid str object:
>
>    >>> '\udd00'
>    '\udd00'
>    >>> '\udd00'.encode('utf-32')
>    b'\xff\xfe\x00\x00\x00\xdd\x00\x00'
>    >>> '\udd00'.encode('utf-16')
>    b'\xff\xfe\x00\xdd'
>
> I was simply stating that UTF-8 is not a bijection between unicode
> strings and octet strings (even forgetting Python). Enriching Unicode
> with 128 surrogates (U+DC80..U+DCFF) establishes a bijection, but not
> without side effects.

But it's not a valid Unicode string, so a Unicode encoding can't be
expected to cope with it. Mathematically, 0xC0 0x80 would represent
U+0000, and some UTF-8 codecs generate and accept this (in order to
allow U+0000 without ever yielding 0x00), but that doesn't mean that
UTF-8 should allow that byte sequence.

The only reason to craft some kind of Unicode string for any arbitrary
sequence of bytes is the "smuggling" effect used for file name
handling. There is no reason to support invalid Unicode codepoints.

ChrisA