csv.DictReader line skipping should be considered a bug?

Mon Dec 11 08:48:32 EST 2017

On 2017-12-05, Steve D'Aprano <steve+python at pearwood.info> wrote:
> On Wed, 6 Dec 2017 04:20 am, Jason wrote:
>> while iterating over two files, which are line-by-line
>> corresponding. The DictReader skipped ahead many lines
>> breaking the line-by-line correspondence.
>
> Um... this doesn't follow. If they are line-by-line
> corresponding, then they should skip the same number of blank
> lines and read the same number of non-blank lines.
>
> Even if one file has blanks and the other does not, if you
> iterate the over the records themselves, they should keep their
> correspondence.
>
> I'm afraid that if you want to convince me this is a buggy
> design, you need to demonstrate a simple pair of CSV files
> where the non-blank lines are corresponding (possibly with
> differing numbers of blanks in between) but the CSV readers get
> out of alignment somehow.

Preface: I'm not arguing for this to be changed--it obviously
cannot be at this point, and we know how to work around it when
it matters--although the current design does make finding the
erronesou records needlessly harder than it needs to be.

Examine the records that DictReader returns for the following csv
file and see if you think it still feels obvious and usable.

A,B,C
a,b,c
a,b
a

a,b
a,b,c

Furthermore, see what DictWriter produces from this program:

with open("wcsv.csv", 'w', newline='') as f:
    writer = csv.DictWriter(f, fieldnames=('A', 'B', 'C'))
    writer.writeheader()
    for rec in (
            {'A': 'a', 'B': 'b', 'C': 'c'},
            {'A': 'a', 'B': 'b'},
            {'A': 'a'},
            {},
            {'A': 'a'},
            {'A': 'a', 'B': 'b'},
            {'A': 'a', 'B': 'b', 'C': 'c'},):
        writer.writerow(rec)

DictReader doesn't handle the output of DictWriter in a usable
and recoverable way.

>> And I want to argue that the difference of behavior should be
>> considered a bug. It should be considered as such because: 1.
>> I need to know what's in the file to know what class to use.
>
> Sure. But blank lines don't tell you what class to use.
>
>> The file content should not break at-least-1-record-per-line.
>
> Blank lines DO break that requirement. A blank line is not a
> record.

Except it inconsistently is one for DictWriter, which if you were
right should produce no output for an empty dict.

> is a blank record with five empty fields. \n alone is just a
> blank. The DictReader correctly returns records with blank
> fields.

The question to me is what should DictReader do when the hopeful
constraint that the header defines the number of fields is broken
in the data. In my opinion, it should do a thing that makes it
the simplest to handle the situation for the programmer. This is
in fact usually what happens. When there are more records than
define in the header, you can choose what happens by setting
extrasaction. When some records are missing--it sets them to
None. Except, when all the records are missing, it silently hides
the error with no ability provided to recover it.

-- 
Neil Cerutti