Newbie question about text encoding

Sun Mar 8 04:09:50 EDT 2015

Chris Angelico <rosuav at gmail.com>:

> Once again, you appear to be surprised that invalid data is failing.
> Why is this so strange? U+DD00 is not a valid character. It is quite
> correct to throw this error.

'\udd00' is a valid str object:

   >>> '\udd00'
   '\udd00'
   >>> '\udd00'.encode('utf-32')
   b'\xff\xfe\x00\x00\x00\xdd\x00\x00'
   >>> '\udd00'.encode('utf-16')
   b'\xff\xfe\x00\xdd'

I was simply stating that UTF-8 is not a bijection between unicode
strings and octet strings (even forgetting Python). Enriching Unicode
with 128 surrogates (U+DC80..U+DCFF) establishes a bijection, but not
without side effects.

Marko