Capturing the bad codes that raise UnicodeError exceptions during decoding

Thu Aug 4 16:11:05 EDT 2016

On Thu, Aug 4, 2016 at 3:24 PM Malcolm Greene <python at bdurham.com> wrote:

> Hi Chris,
>
> Thanks for your suggestions. I would like to capture the specific bad
> codes *before* they get replaced. So if a line of text has 10 bad codes
> (each one raising UnicodeError), I would like to track each exception's
> bad code but still return a valid decode line when finished.
>
> My goal is to count the total number of UnicodeExceptions within a file
> (as a data quality metric) and track the frequency of specific bad
> code's (via a collections.counter dict) to see if there's a pattern that
> can be traced to bad upstream process.
>

Give this a shot (below). It seems to do what you want.

import csv
from collections import Counter
from io import BytesIO

def _cleanline(line, counts=Counter()):
    try:
        return line.decode()
    except UnicodeDecodeError as e:
        counts[line[e.start:e.end]] += 1
        return line[:e.start].decode() + _cleanline(line[e.end:], counts)

def cleanlines(fp):
    '''
    convert data to text; track decoding errors
    ``fp`` is an open file-like iterable of lines'
    '''
    cleanlines.errors = Counter()
    for line in fp:
        yield _cleanline(line, cleanlines.errors)

f = BytesIO(b'''\
this,is line,one
line two,has junk,\xffin it
so does,\xfa\xffline,three
''')

for row in csv.reader(cleanlines(f)):
    print(row)

print(cleanlines.errors.most_common())