Why are some unicode error handlers "encode only"?

Steven D'Aprano steve+comp.lang.python at pearwood.info
Sun Mar 11 10:37:54 EDT 2012


At least two standard error handlers are documented as working for 
encoding only:

xmlcharrefreplace
backslashreplace

See http://docs.python.org/library/codecs.html#codec-base-classes

and http://docs.python.org/py3k/library/codecs.html

Why is this? I don't see why they shouldn't work for decoding as well. 
Consider this example using Python 3.2:

>>> b"aaa--\xe9z--\xe9!--bbb".decode("cp932")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'cp932' codec can't decode bytes in position 9-10: 
illegal multibyte sequence

The two bytes b'\xe9!' is an illegal multibyte sequence for CP-932 (also 
known as MS-KANJI or SHIFT-JIS). Is there some reason why this shouldn't 
or can't be supported?

# This doesn't actually work.
b"aaa--\xe9z--\xe9!--bbb".decode("cp932", "backslashreplace")
=> r'aaa--騷--\xe9\x21--bbb'

and similarly for xmlcharrefreplace.



-- 
Steven



More information about the Python-list mailing list