Customizing character set conversions with an error handler

Jukka Aho jukka.aho at iki.fi
Tue Mar 14 09:36:49 EST 2006


Serge Orlov wrote:

>>     # So the question becomes: how can I make this work
>>     # in a graceful manner?

> change the return statement with this code:
>
> return (substitution.encode(error.encoding,"practical").decode(
>        error.encoding), error.start+1)

Thanks, that was a quite neat recursive solution. :) I wouldn't have 
thought of that.

I ended up doing it without the recursion, by testing the individual 
problematic code points with .encode() within the handler, and catching 
the possible exceptions:

--- 8< ---

    # This is our original problematic code point:
    c = error.object[error.start]

    while 1:

        # Search for a substitute code point in
        # our table:

        c = table.get(c)

        # If a substitute wasn't found, convert the original code
        # point into a hexadecimal string representation of itself
        # and exit the loop.

        if c == None:
            c = u"[U+%04x]" % ord(error.object[error.start])
            break

        # A substitute was found, but we're not sure if it is OK
        # for for our target encoding. Let's check:

        try:
            c.encode(error.encoding,'strict')
            # No exception; everything was OK, we
            # can break off from the loop now
            break

        except UnicodeEncodeError:
            # The mapping that was found in the table was not
            # OK for the target encoding. Let's loop and try
            # again; there might be a better (more generic)
            # substitution in the chain waiting for us.
            pass

--- 8< ---

-- 
znark




More information about the Python-list mailing list