[Python-ideas] csv.DictReader could handle headers more intelligently.

Thu Jan 24 16:11:34 CET 2013

On Thu, 2013-01-24 at 13:38 +0100, Antoine Pitrou wrote:
> > 1. Do any data conditioning by ignoring empty lines and lines of
> > just field delimiters before the header row (consensus seems to be
> > "no")

Well, I wouldn't necessarily say we have a consensus on this one.  This
idea received a +1 from Bruce Leban and an "I don't see any reason not
to" from Steven D'Aprano.

Objections are:

1. It's a backwards-incompatible change.  (This could be mitigated in a
couple ways, as with the duplicate header problem, below). I don't think
anyone has argued that programmers would ever actually want to use the
blank line as the headers, only that they may be doing it now as a
workaround, and breaking the workarounds is undesirable.

2. You should pre-process the CSV instead of adapting the reader to
malformations. (In which case, I think the DictReader.reader attribute
should be better documented, so programmers have some guidance how to do
the pre-processing, as the current DictReader can cause data loss which
would make it difficult to recover the real headers without using the
underlying reader).

> > 2. Give an error when encountering a duplicate field name (which 
> > will lead to data loss when reading from the file) (consensus seems
> > to be "yes") 

Mostly, but with a strong objection from Mark Hackett, and hesitation
about altering current behavior from Amaury Forgeot d'Arc.

Proposals to solve this problem:

1. Raise an exception (After setting the fieldnames, I think, so if you
wanted to catch and continue or catch and edit the conflicting
fieldnames, you could do so).

2. Combine multiple fields with the same header into a list under the
same key.

2a. Make lists when there are multiple fields, but otherwise, key to
strings as is currently done

2b. For consistency, make all values lists, regardless of the number of
columns.

Proposals for implementation:

1. Create a new Reader class.  Suggestions include
"CarefulDictReader" (for the version that raises an exception) and
"MultiDictReader" (for the versions that make lists of values).

2. Add an option to DictReader.  The idea to add an option for a
MultiDictReader-like behavior was objected to, but there were multiple
suggestions to add an option for raising an exception, in one case with
the idea that in the future ("Python 4") the option would be standard
behavior.

Note: If we were to implement a CarefulDictReader, it could, without
backward incompatibility, implement both skipping of blank header lines,
and exception raising on duplicate headers.

Cheers,
Cliff