[Python-ideas] csv.DictReader could handle headers more intelligently.

Thu Jan 24 11:37:57 CET 2013

On Wednesday 23 Jan 2013, Jerry Hill wrote:
> On Wed, Jan 23, 2013 at 1:32 PM, Mark Hackett
> 
> <mark.hackett at metoffice.gov.uk> wrote:
> > I can't see why there would be duplicate column headers for valid reason.
> >
> > Someone may have written their CSV export incorrectly, but that's not
> > actually valid.
> 
> Sure it is.  Since there is no formal spec for .csv files, having a
> multiple columns with the same text in the header is a perfectly valid
> .csv file.  For what it's worth, the informal spec for csv files seems

Then you don't want it put in a dictionary, since a dictionary doesn't allow 
duplicate fields.

> to be "whatever Excel does" and Excel (and every other
> spreadsheet-oriented program) is happy to let you have duplicated
> headers too.

You don't, in Excel, use the name of the column in your calculation, you use 
the unique column ID (A, B, C..AA, AB, ...).

> 
> > It would therefore be arguable for the program to give at least a WARNING
> > that it's throwing data away.
> 
> I think the library should give the programmer some sort of indication
> that they are losing data.  Personally, I'd prefer an exception which
> can either be caught or not, depending on whether the program is
> designed to handle the situation or not.
> 
> > However, since python is mechanising this as a dictionary and since in
> > python setting A to 1 then setting A to 3 would throw away the earlier
> > value for A and the import function working AS EXPECTED in Python.
> 
> I'm not sure this behavior merits the all-caps "AS EXPECTED" label.
> It's not terribly surprising once you sit down and think about it, but
> it's certainly at least a little unexpected to me that data is being
> thrown away with no notice.  It's unusual for errors to pass silently
> in python.
> 

Python doesn't warn about duplicate addition to keys, so as expected, it isn't 
warning about them now.

Programming languages are hard enough to understand (why does everyone use a 
different way of stopping a loop???), so it's not a good idea to have little 
codas to the way things are done "oh, unless you're putting it into a 
dictionary via this call...".

I can understand the library call doing so, mind, but I can also see the 
writer of the library going "You're putting it into a dictionary. Well, you 
know what happens when you put duplicate entries in them, right, else you 
wouldn't be using this routine that puts csv entries into a dictionary".