[Python-ideas] csv.DictReader could handle headers more intelligently.

Tue Jan 29 13:26:13 CET 2013

On 29/01/13 04:45, Mark Hackett wrote:
> On Monday 28 Jan 2013, MRAB wrote:
>> It shouldn't silently drop the columns
>>
>
> Why not?
>
> It's adding to a dictionary and adding a duplicate key replaces the earlier
> one.

Then adding to a dictionary was a mistake.

The choice of a dict is *implementation*, not *interface*. The interface needed
is to return a mapping of column names to values. The nature of that mapping is
an implementation detail, and dict is only the simplest solution, not necessarily
the correct solution.

There is nothing about CSV files that imply that the right behaviour is to drop
columns. The nature of CSV files is to allow duplicate column names, and so CSV
readers should too. That implies that using a dict, which silently drops duplicate
keys, was the wrong choice.

We might argue that using duplicate column names is stupid, but CSV supports it,
and so should CSV readers.

> If it dropped the columns and shouldn't have, then the results will be seen to
> be wrong anyway, so there's not a huge amount of need for this.

You cannot assume that the caller knows that there are duplicated column names.
That's why dropping columns is problematic: it *silently* drops them, giving the
caller no idea that it has happened.

Given that DictReader already exists, and that there probably is someone out
there who is relying on it silently eating columns, I think that the only
reasonable way forward is to add a new reader that supports multiple columns
with the same name. The caller can then use whichever reader suits their
use-case:

* I don't care about duplicate-name columns, just give me some arbitrary one;
   - use DictReader

* I want all of the duplicate-name columns;
   - use MultiDictReader

* I want some of the duplicate-name columns;
   - use MultiDictReader, and then filter the results as you get them

(When I put it like that, DictReader sounds even less useful. But as I said,
I daresay *somebody* is relying on it right now, so we can't change it.)

> And why, really, are there duplicate column names in there anyway? You can
> come up with the assertion that this might be wanted, but they're not normally
> what you see in a csv file.
>
> I've never seen nor used a csv file that duplicated column names other than
> being blank.

Well there you go. That is exactly one such example of duplicate column names.

-- 
Steven