[Python-ideas] csv.DictReader could handle headers more intelligently.

Thu Jan 24 13:33:07 CET 2013

On Thu, Jan 24, 2013 at 9:55 PM, Shane Green <shane at umbrellacode.com> wrote:
> Not sure if I'm reading the discussion correctly, but it sounds like there's
> discussion about whether swallowing CSV values when confronted with multiple
> columns by the same name, which seems very incorrect if so.  CSV doesn't
> even mandate column headers exist at all, as far as I know.  If anything I
> would think mapping column positions to header values would make sense, such
> that header.items() -> [(0, header1), (1, header2), (2, header3), etc.], and
> header1 and header2 could be equal.  To work with rows as dictionaries they
> can follow the FieldStorage model and have lists of values–either when
> there's a collision, or always–so all column values are contained.

That's not quite the discussion. The discussion is specifically about
*DictReader*, and whether it should:

1. Do any data conditioning by ignoring empty lines and lines of just
field delimiters before the header row (consensus seems to be "no")
2. Give an error when encountering a duplicate field name (which will
lead to data loss when reading from the file) (consensus seems to be
"yes")

The problem with the latter suggestion is that it's a backwards
incompatible change - code where "use the last column with that name"
is the correct behaviour currently works, but would be broken if that
situation was declared an error.

Rather than messing with DictReader, it seems more fruitful to further
investigate the idea of a namedtuple based reader
(http://bugs.python.org/issue1818). The "multiple columns with the
same name" use case seems specialised enough that the standard readers
can continue to ignore it (although, as noted earlier in this thread,
a namedtuple based reader will correctly reject duplicate column
names)

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia