[Python-ideas] csv.DictReader could handle headers more intelligently.

Shane Green shane at umbrellacode.com
Thu Jan 24 16:28:49 CET 2013


Since every form of CSV file counts EOL as a line terminator, I think discarding empty lines preceding the headers is arguably acceptable, but do not think discarding lines of just delimiters would be.  What about extending the DictReader API so it was easy to perform these actions explicitly, such as being able to discard() the field names to be re-evaluated on the next line?





Shane Green 
www.umbrellacode.com
805-452-9666 | shane at umbrellacode.com

On Jan 24, 2013, at 7:11 AM, "J. Cliff Dyer" <jcd at sdf.lonestar.org> wrote:

> On Thu, 2013-01-24 at 13:38 +0100, Antoine Pitrou wrote:
>>> 1. Do any data conditioning by ignoring empty lines and lines of
>>> just field delimiters before the header row (consensus seems to be
>>> "no")
> 
> Well, I wouldn't necessarily say we have a consensus on this one.  This
> idea received a +1 from Bruce Leban and an "I don't see any reason not
> to" from Steven D'Aprano.
> 
> Objections are:
> 
> 1. It's a backwards-incompatible change.  (This could be mitigated in a
> couple ways, as with the duplicate header problem, below). I don't think
> anyone has argued that programmers would ever actually want to use the
> blank line as the headers, only that they may be doing it now as a
> workaround, and breaking the workarounds is undesirable.
> 
> 2. You should pre-process the CSV instead of adapting the reader to
> malformations. (In which case, I think the DictReader.reader attribute
> should be better documented, so programmers have some guidance how to do
> the pre-processing, as the current DictReader can cause data loss which
> would make it difficult to recover the real headers without using the
> underlying reader).
> 
> 
>>> 2. Give an error when encountering a duplicate field name (which 
>>> will lead to data loss when reading from the file) (consensus seems
>>> to be "yes") 
> 
> Mostly, but with a strong objection from Mark Hackett, and hesitation
> about altering current behavior from Amaury Forgeot d'Arc.
> 
> Proposals to solve this problem:
> 
> 1. Raise an exception (After setting the fieldnames, I think, so if you
> wanted to catch and continue or catch and edit the conflicting
> fieldnames, you could do so).
> 
> 2. Combine multiple fields with the same header into a list under the
> same key.
> 
> 2a. Make lists when there are multiple fields, but otherwise, key to
> strings as is currently done
> 
> 2b. For consistency, make all values lists, regardless of the number of
> columns.
> 
> Proposals for implementation:
> 
> 1. Create a new Reader class.  Suggestions include
> "CarefulDictReader" (for the version that raises an exception) and
> "MultiDictReader" (for the versions that make lists of values).
> 
> 2. Add an option to DictReader.  The idea to add an option for a
> MultiDictReader-like behavior was objected to, but there were multiple
> suggestions to add an option for raising an exception, in one case with
> the idea that in the future ("Python 4") the option would be standard
> behavior.
> 
> 
> Note: If we were to implement a CarefulDictReader, it could, without
> backward incompatibility, implement both skipping of blank header lines,
> and exception raising on duplicate headers.
> 
> Cheers,
> Cliff
> 
> 
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20130124/4dc25083/attachment.html>


More information about the Python-ideas mailing list