Newbie question about text encoding
Marko Rauhamaa
marko at pacujo.net
Sun Mar 8 04:09:50 EDT 2015
Chris Angelico <rosuav at gmail.com>:
> Once again, you appear to be surprised that invalid data is failing.
> Why is this so strange? U+DD00 is not a valid character. It is quite
> correct to throw this error.
'\udd00' is a valid str object:
>>> '\udd00'
'\udd00'
>>> '\udd00'.encode('utf-32')
b'\xff\xfe\x00\x00\x00\xdd\x00\x00'
>>> '\udd00'.encode('utf-16')
b'\xff\xfe\x00\xdd'
I was simply stating that UTF-8 is not a bijection between unicode
strings and octet strings (even forgetting Python). Enriching Unicode
with 128 surrogates (U+DC80..U+DCFF) establishes a bijection, but not
without side effects.
Marko
More information about the Python-list
mailing list