[Python-ideas] csv.DictReader could handle headers more intelligently.

Tue Jan 29 12:33:05 CET 2013

Okay, sure, I guess the starting point of my argument is, DictReader is nice, why not make one that supports duplicate columns and easily implement the other behaviors, whether it's discarding values from duplicate columns so there's a one-to-one mapping, or just raising an exception when a duplicate column is encountered to start with, in terms of something that handles this superset of legal CSV formats that do in fact specify exactly what header names each of their values should be mapped to?  

Shane Green 
www.umbrellacode.com
408-692-4666 | shane at umbrellacode.com

On Jan 29, 2013, at 3:16 AM, Oscar Benjamin <oscar.j.benjamin at gmail.com> wrote:

> On 29 January 2013 10:18, Shane Green <shane at umbrellacode.com> wrote:
>> So I wasn't really questioning the usefulness of the dictionary
>> representation, but couldn't the returned object also let you access the
>> header and value sequences, etc?  I was also thinking the conversion to
>> simple dict with single (non-list) values per column could be part of the
>> API.
>> 
>> Appending duplicate field values as they're read reflects the order the
>> duplicate entries appear in the source (when I've encountered CSV that
>> purposely used duplicate column headers, the sequence they appear was
>> critical).  The output from the current implementation should reflect the
>> last duplicate value, as that always replaces previous ones in the dict, so
>> my conversions returned the last value (-1), which should do the same…I
>> think.  It was a straw man ;-).
>> 
>> I see your point about the point.  I think it would be good to have an
>> implementation that kept all the information but still put the most usable
>> API on it possible, rather than saying you can't have dictionary access
>> unless you want to lose duplicate values, for example.  I mean, I've needed
>> to consume CSV a lot, and that's what would have made the module useful to
>> me, and the implementation that keeps all the information and lets it easily
>> to trimmed as-not-needed seems better than one that just wipes it out to
>> start.
> 
> This is exactly what the csv.reader objects do.
> 
> While it is a problem that csv.DictReader silently discards data when
> that is very likely an error, there's no need to try and guess how
> people want to deal with duplicate column headers and invent a new
> class for it. It's easy enough to write your own wrapper that exactly
> performs whatever processing you happen to want:
> 
> def multireader(csvreader):
>    try:
>        headers = next(csvreader)
>    except StopIteration:
>        raise ValueError('No header')
>    for row in csvreader:
>        d = defaultdict(list)
>        for h, v in zip(headers, row):
>            d[h].append(v)
>        yield d
> 
> 
> Oscar

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20130129/14487ddd/attachment.html>