Capturing the bad codes that raise UnicodeError exceptions during decoding

Thu Aug 4 16:00:33 EDT 2016

On Fri, Aug 5, 2016 at 5:22 AM, Malcolm Greene <python at bdurham.com> wrote:
> Thanks for your suggestions. I would like to capture the specific bad
> codes *before* they get replaced. So if a line of text has 10 bad codes
> (each one raising UnicodeError), I would like to track each exception's
> bad code but still return a valid decode line when finished.
>

Interesting. Sounds to me like the simplest option is to open the file
as binary, split it on b"\n", and decode line by line before giving it
to the csv module. The csv.reader "csvfile" argument doesn't actually
have to be a file - it can be anything that yields lines. So you can
put a generator in between, like this:

def decode(binary):
    for line in binary:
        try:
            yield line.decode("utf-8")
        except UnicodeDecodeError:
            log_stats()

def read_dirty_file(fn):
    with open(fn, "rb") as f:
        for row in csv.DictReader(decode(f)):
            process(row)

Or what Random said, which is also viable.

ChrisA