[Python-Dev] PEP 383 update: utf8b is now the error handler

"Martin v. Löwis" martin at v.loewis.de
Thu May 7 01:16:18 CEST 2009


> I qualify with a). I believe I understand c) but, as explained in my
> other post, I do not think your reason applies.  In fact, I think
> concern for naming rights might suggest that you *not* reuse the name
> for something different.  I would have to learn more about the existing
> 'surrogates' handler to judge Antione's suggestion 'surrogates-pass'.
> 'Surrogates-escape' is pretty good for the new handler since, to my
> understanding, it 'escapes' 'bad bytes' by prefixing them with bits that
> push them to the surrogates plane.

See issue 3672. In essence, in python 2.5:

py> u"\ud800".encode("utf-8")
'\xed\xa0\x80'
py> '\xed\xa0\x80'.decode("utf-8")
u'\ud800'

In 3.1,

py> "\ud800".encode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in
position 0: surrogates not allowed
py> "\ud800".encode("utf-8","surrogates")
b'\xed\xa0\x80'
py> b'\xed\xa0\x80'.decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2:
illegal encoding
py> b'\xed\xa0\x80'.decode("utf-8","surrogates")
'\ud800'

Regards,
Martin


More information about the Python-Dev mailing list