Capturing the bad codes that raise UnicodeError exceptions during decoding

Thu Aug 4 15:00:31 EDT 2016

On Fri, Aug 5, 2016 at 4:47 AM, Malcolm Greene <python at bdurham.com> wrote:
> I'm processing a lot of dirty CSV files and would like to track the bad
> codes that are raising UnicodeErrors. I'm struggling how to figure out
> what the exact codes are so I can track them, them remove them, and then
> repeat the decoding process for the current line until the line has been
> fully decoded so I can pass this line on to the CSV reader. At a high
> level it seems that I need to wrap the decoding of a line until it
> passes with out any errors. Any suggestions appreciated.

Remove them? Not sure what you mean, exactly; but would an
errors="backslashreplace" decode do the job? Something like (assuming
you use Python 3):

def read_dirty_file(fn):
    with open(fn, encoding="utf-8", errors="backslashreplace") as f:
        for row in csv.DictReader(f):
            process(row)

You'll get Unicode text, but any bytes that don't make sense in UTF-8
will be represented as eg \x80, with an actual backslash. Or use
errors="replace" to hide them all behind U+FFFD, or other forms of
error handling. That'll get done at a higher level than the CSV
reader, like you suggest.

ChrisA