Encoding of surrogate code points to UTF-8

wxjmfauth at gmail.com wxjmfauth at gmail.com
Tue Oct 8 14:24:58 EDT 2013


--------

>>> sys.version
'3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:03:43) [MSC v.1600 32 bit (Intel)]'
>>> '\ud800'.encode('utf-8')
Traceback (most recent call last):
  File "<eta last command>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: 
surrogates not allowed
>>> '\ud800'.encode('utf-32-be')
b'\x00\x00\xd8\x00'
>>> '\ud800'.encode('utf-32-le')
b'\x00\xd8\x00\x00'
>>> '\ud800'.encode('utf-32')
b'\xff\xfe\x00\x00\x00\xd8\x00\x00'


jmf



More information about the Python-list mailing list