UTF16 codec doesn't round-trip?

John Perks and Sarah Mount johnandsarah at estragon.freeserve.co.uk
Sat May 28 06:48:07 EDT 2005


(My Python uses UTF16 natively; can someone with UTF32 Python let me
know if that behaves differently?)

>>> import codecs
>>> u'\ud800' # part of surrogate pair
u'\ud800'
codecs.utf_16_be_encode(_)[0]
'\xd8\x00'
codecs.utf_16_be_decode(_)[0]
Traceback (most recent call last):
  File "<input>", line 1, in ?
UnicodeDecodeError: 'utf16' codec can't decode bytes in position 0-1:
unexpected end of data

If the ascii can't be recognized as UTF16, then surely the codec
shouldn't have allowed it to be encoded in the first place? I could
understand if it was trying to decode ascii into (native) UTF32.

On a similar note, if you are using UTF32 natively, are you allowed to
have raw surrogate escape sequences (paired or otherwise) in unicode
literals?

Thanks

John





More information about the Python-list mailing list