codecs.register_error for "strict", unicode.encode() and str.decode()

Fri Jul 27 03:16:48 EDT 2012

Alan Franzoni wrote:

> Hello,
> I think I'm missing some piece here.
> 
> I'm trying to register a default error handler for handling exceptions
> for preventing encoding/decoding errors (I know how this works and that
> making this global is probably not a good practice, but I found this
> strange behaviour while writing a proof of concept of how to let Python
> work in a more forgiving way).
> 
> What I discovered is that register_error() for "strict" seems to work in
> the way I expect for string decoding, not for unicode encoding.
> 
> That's what happens on Mac, Python 2.7.1 from Apple:
> 
> melquiades:tmp alan$ cat minimal_test_encode.py
> # -*- coding: utf-8 -*-
> 
> import codecs
> 
> def handle_encode(e):
>     return ("ASD", e.end)
> 
> codecs.register_error("strict", handle_encode)
> 
> print u"à".encode("ascii")
> 
> melquiades:tmp alan$ python minimal_test_encode.py
> Traceback (most recent call last):
>   File "minimal_test_encode.py", line 10, in <module>
>     u"à".encode("ascii")
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in
> position 0: ordinal not in range(128)
> 
> 
> OTOH this works properly:
> 
> melquiades:tmp alan$ cat minimal_test_decode.py
> # -*- coding: utf-8 -*-
> 
> import codecs
> 
> def handle_decode(e):
>     return (u"ASD", e.end)
> 
> codecs.register_error("strict", handle_decode)
> 
> print "à".decode("ascii")
> 
> melquiades:tmp alan$ python minimal_test_decode.py
> ASDASD
> 
> 
> What piece am I missing? The doc at
> http://docs.python.org/library/codecs.html says " For
> encoding /error_handler/ will be called with a UnicodeEncodeError
> 
<http://docs.python.org/library/exceptions.html#exceptions.UnicodeEncodeError>
> instance, which contains information about the location of the error.", is
> there any reason why the standard "strict" handler cannot be replaced?

The error handling for the standard erorrs "strict", "replace", "ignore", 
and "xmlcharrefreplace" is hardwired, see function unicode_encode_ucs1 in 
Lib/unicodeobject.c:

            if (known_errorHandler==-1) {
                if ((errors==NULL) || (!strcmp(errors, "strict")))
                    known_errorHandler = 1;
...
            switch (known_errorHandler) {
            case 1: /* strict */
                raise_encode_exception(&exc, encoding, unicode, collstart, 
collend, reason);
                goto onError;

You need another gun to shoot yourself in the foot ;)