Customizing character set conversions with an error handler

Sun Mar 12 16:33:49 EST 2006

Jukka Aho wrote:
> When converting Unicode strings to legacy character encodings, it is
> possible to register a custom error handler that will catch and process
> all code points that do not have a direct equivalent in the target
> encoding (as described in PEP 293).
>
> The thing to note here is that the error handler itself is required to
> return the substitutions as Unicode strings - not as the target encoding
> bytestrings. Some lower-level gadgetry will silently convert these
> strings to the target encoding.
>
> That is, if the substitution _itself_ doesn't contain illegal code
> points for the target encoding.
>
> Which brings us to the point: if my error handler for some reason
> returns illegal substitutions (from the viewpoint of the target
> encoding), how can I catch _these_ errors and make things good again?
>
> I thought it would work automatically, by calling the error handler as
> many times as necessary, and letting it work out the situation, but it
> apparently doesn't. Sample code follows:
>
>
>     # So the question becomes: how can I make this work
>     # in a graceful manner?
>

change the return statement with this code:

return (substitution.encode(error.encoding,"practical").decode(
        error.encoding), error.start+1)

  -- Serge