Why are some unicode error handlers "encode only"?

Walter Dörwald walter at livinglogic.de
Sun Mar 11 12:10:12 EDT 2012


On 11.03.12 15:37, Steven D'Aprano wrote:

> At least two standard error handlers are documented as working for
> encoding only:
>
> xmlcharrefreplace
> backslashreplace
>
> See http://docs.python.org/library/codecs.html#codec-base-classes
>
> and http://docs.python.org/py3k/library/codecs.html
>
> Why is this? I don't see why they shouldn't work for decoding as well.

Because xmlcharrefreplace and backslashreplace are *error* handlers. 
However the bytes sequence b'〹' does *not* contain any bytes that 
are not decodable for e.g. the ASCII codec. So there are no errors to 
handle.

> Consider this example using Python 3.2:
>
>>>> b"aaa--\xe9z--\xe9!--bbb".decode("cp932")
> Traceback (most recent call last):
>    File "<stdin>", line 1, in<module>
> UnicodeDecodeError: 'cp932' codec can't decode bytes in position 9-10:
> illegal multibyte sequence
>
> The two bytes b'\xe9!' is an illegal multibyte sequence for CP-932 (also
> known as MS-KANJI or SHIFT-JIS). Is there some reason why this shouldn't
> or can't be supported?

The byte sequence b'\xe9!' however is not something that would have been 
produced by the backslashreplace error handler. b'\\xe9!' (a sequence 
containing 5 bytes) would have been (and this probably would decode 
without any problems with the cp932 codec).

> # This doesn't actually work.
> b"aaa--\xe9z--\xe9!--bbb".decode("cp932", "backslashreplace")
> =>  r'aaa--騷--\xe9\x21--bbb'
>
> and similarly for xmlcharrefreplace.

This would require a postprocess step *after* the bytes have been 
decoded. This is IMHO out of scope for Python's codec machinery.

Servus,
    Walter




More information about the Python-list mailing list