[Python-ideas] csv.DictReader could handle headers more intelligently.

Wed Jan 23 18:37:01 CET 2013

On Wed, 2013-01-23 at 18:08 +0100, Amaury Forgeot d'Arc wrote:
> Hi,
> 
> 2013/1/23 J. Cliff Dyer <jcd at sdf.lonestar.org>
>         On Tue, 2013-01-22 at 17:51 -0800, alex23 wrote:
>         > I don't think we should start adding support for every
>         malformed type
>         > of csv file that exists. It's easy enough to remove the
>         unnecessary
>         > lines yourself before passing them to DictReader:
>         >
>         >     from csv import DictReader
>         >
>         >     with open('malformed.csv','rb') as csvfile:
>         >         csvlines = list(l for l in csvfile if l.strip())
>         >         csvreader = DictReader(csvlines)
>         >
>         > Personally, if I was dealing with this as often as you are,
>         I'd
>         > probably make a custom context manager instead. The problem
>         lies in
>         > the files themselves, not in csv's response to them.
>         > _______________________________________________
>         > Python-ideas mailing list
>         > Python-ideas at python.org
>         > http://mail.python.org/mailman/listinfo/python-ideas
>         >
>         
>         
>         With all due respect, while you make a good point that we
>         don't want to
>         start special casing every malformed type of CSV, there is
>         absolutely
>         something wrong with DictReader's response to files that have
>         duplicate
>         headers. It throws away data silently.
> 
> 
> That's how Python dictionaries work, by design:
>     d = {'a': 1, 'a': 2}
> "silently" discards the first value.
> 
> 
>         If you (and others on this list) aren't in favor of trying to
>         find the
>         right header row (which I can understand: "In the face of
>         ambiguity,
>         refuse the temptation to guess."), maybe a better solution
>         would be to
>         raise a (suppressible) exception if the headers aren't
>         uniquely named.
>         ("Errors should never pass silently.  Unless explicitly
>         silenced.")
> 
> 
> What about a subclass then:
> 
> 
> class CarefulDictReader(csv.DictReader):
>     def __init__(self, *args, **kwargs):
>         super().__init__(*args, **kwargs)
>         fieldnames = self.fieldnames
>         if len(fieldnames) != len(set(fieldnames)):
>             raise ValueError("Duplicate field names", fieldnames)
> 
> 
> 
> 
> -- 
> Amaury Forgeot d'Arc

Whether it's a subclass or a change to the existing class is worth
having a discussion about.  Obviously, the change could be made in a
subclass.  Currently, that's what I do.  The question at issue is
whether it should be made in the original.  My position is that
something should change in the standard library, whether that is
modifying the code in some way to handle edge cases more robustly, or
updating the documentation to advise programmers on how to handle files
that aren't perfectly formed.

This might include documenting that self.reader is an available
attribute (where the programmer could iterate to find the header row
they're looking for, if needed, and then assign it to self.fieldnames).

I do like the idea of assigning the fieldnames variable and then raising
the ValueError, so if the user silences the exception, they still have
access to the field names found.  However, I think the behavior should
be overridden on the fieldnames property, so as not to change the
semantics of the DictReader.