Newbie question about text encoding
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Sun Mar 8 14:25:02 EDT 2015
Marko Rauhamaa wrote:
> Chris Angelico <rosuav at gmail.com>:
>
>> Once again, you appear to be surprised that invalid data is failing.
>> Why is this so strange? U+DD00 is not a valid character.
But it is a valid non-character code point.
>> It is quite correct to throw this error.
>
> '\udd00' is a valid str object:
Is it though? Perhaps the bug is not UTF-8's inability to encode lone
surrogates, but that Python allows you to create lone surrogates in the
first place. That's not a rhetorical question. It's a genuine question.
> >>> '\udd00'
> '\udd00'
> >>> '\udd00'.encode('utf-32')
> b'\xff\xfe\x00\x00\x00\xdd\x00\x00'
> >>> '\udd00'.encode('utf-16')
> b'\xff\xfe\x00\xdd'
If you explicitly specify the endianness (say, utf-16-be or -le) then you
don't get the BOMs.
> I was simply stating that UTF-8 is not a bijection between unicode
> strings and octet strings (even forgetting Python). Enriching Unicode
> with 128 surrogates (U+DC80..U+DCFF) establishes a bijection, but not
> without side effects.
--
Steven
More information about the Python-list
mailing list