using DictReader() with .decode('utf-8', 'ignore')

Tue Apr 14 09:48:26 EDT 2015

On Tue, 14 Apr 2015 11:37 pm, Vincent Davis wrote:

>> Which DictReader? Do you mean the one in the csv module? I will assume
>> so.
>>
> yes.
> 
> 
>>
>> # untested
>> with open(dfile, 'r', encoding='utf-8', errors='ignore', newline='') as
>> f:
>>     reader = csv.DictReader(f)
>>     for row in reader:
>>         print(row['fieldname'])
>>
> 
> What you have seems to work, now I need to go find my strange symbols that
> are not 'utf-8' and see what happens
> I was thought, that I had to open with 'rb' to use encoding?

No, in Python 3 the rules are:

'rb' reads in binary mode, returns raw bytes without doing any decoding;

'r' reads in text mode, returns Unicode text, using the codec/encoding
specified. By default, if no encoding is specified, I think UTF-8 is used,
but it may depend on the platform.

If you are getting decoding errors when reading the file, it is possible
that the file isn't actually UTF-8. One test you can do:

with open(dfile, 'rb') as f:
    for line in f:
        try:
            s = line.decode('utf-8', 'strict')
        except UnicodeDecodeError as err:
            print(err)

If you need help deciphering the errors, please copy and paste them here and
we'll see what we can do.

-- 
Steven