[Python-ideas] csv.DictReader could handle headers more intelligently.

Wed Jan 30 15:04:47 CET 2013

I think this may have been lost somewhere in the last 90 messages, but
adding a warning to DictReader in the docs seems like it solves almost the
entire problem.  New csv.DictReader users are informed, no one's old code
breaks, and a separate discussion can be had about whether it's worth
adding a csv.MultiDictReader which uses lists.

On Wed, Jan 30, 2013 at 7:59 AM, Shane Green <shane at umbrellacode.com> wrote:

> So I've done some thinking on it, a bit of research, etc., and have worked
> with a lot of different CSV content.  There are a lot of parallels between
> the name/value pairs of an HTML form submission, and our use case.
>
> Namely:
> - There's typically only one value per name, but it's perfectly legal to
> have multiple values assigned to a name.
> - When there are duplicate multiple values assigned to a name, order can
> be very important.
> - They made the mistake of mapping names to values; they made the mistake
> of mapping name field names to singular values when there was only one
> value, and multiple values where there were multiple values.
> - Each of these have been deprecated an their FieldStorage now always maps
> field names to lists of values.
>
> I've implemented a Record class I'm going to pitch for feedback.  Although
> I followed the FieldStorage API for a couple of methods, it didn't
> translate very well because their values are complex objects.  This Record
> class is a dictionary type that maps header names to the values from
> columns labeled by that same header.  Most lists have a single field
> because usually headers aren't duplicated.  When multiple values are in a
> field, they are listed in the order they were read from the CSV file.  The
> API provides convenience methods for getting the first or last value listed
> for a given column name, making it very easy to turn work with singular
> values when desired.  The dictionary API will likely bent primary mechanism
> for interacting with it, however, knows the header and row sequences it was
> built from, and provides sequential access to them as well.  In addition to
> working with non-standard CSV, performing transformations, etc.this
> information makes it possible to reproduce correctly ordered CSV.
>
> While I don't really know yet whether it would make sense to support any
> kind of manipulation of values on the record instances themselves, versus
> using more copy()/update() approach to defining modifying records or
> something, but I did decide to wrap the row values in a tuple, making it
> read only.  This was for several reasons.   One was to address a potential
> inconsistency that might arise should we decide to support editing, and the
> other is because the record is the representation of that row read from the
> source file, and so it should always accurately reflect that content.
>
> About the code: I wrote it tonight, tested it for an hour, so it's not
> meant to be perfect or final, but it should stir up a very concrete
> discussion about the API, if nothing else ;-)  I included a generator that
> seemed to work on the some test files.  It most definitely is not meant to
> be critiqued or a distraction, but I've included it in case anyone ends up
> wanting to investigate the things further.  Although the iterator function
> provides a slightly different signature that DictReader, that's not because
> I'm trying toe change anything; please keep in mind the generator was just
> a test.  Also, I'd like to mention one last time that I don't think we
> should change what exists to reflect any of these changes: I was thinking
> it would be a new set of classes and functions that, that would become the
> preferred implementation in the future.
>
>
>
>
> class Record(dict):
>     def __init__(self, headers, fields):
>         if len(headers) != len(fields):
>             # I don't make decicions about how gaps should be filled.
>             raise ValueError("header/field size mismatch")
>         self._headers = headers
>         self._fields = tuple(fields)
>         [self.setdefault(h,[]).append(v) for h,v in self.fielditems()]
>         super(Record, self).__init__()
>     def fielditems(self):
>         """
>             Get header,value sequence that reflects CSV source.
>         """
>         return zip(self.headers(),self.fields())
>     def headers(self):
>         """
>             Get ordered sequence of headers reflecting CSV source.
>         """
>         return self._headers
>     def fields(self):
>         """
>             Get ordered sequence of values reflecting CSV row source.
>         """
>         return self._fields
>     def getfirst(self, name, default=None):
>         """
>             Get value of last field associated with header named
>             'name'; return 'default' if no such value exists.
>         """
>         return self[name][0] if name in self else default
>     def getlast(self, name, default=None):
>         """
>             Get value of last field associated with header named
>             'name'; return 'default' if no such value exists.
>         """
>         return self[name][-1] if name in self else default
>     def getlist(self, name):
>         """
>             Get values of all fields associated with header named 'name'.
>         """
>         return self.get(name, [])
>     def pretty(self, header=True):
>         lines = []
>         if header:
>             lines.append(
>                 ["%s".ljust(10).rjust(20) % h for h in self.headers()])
>         lines.append(
>             ["%s".ljust(10).rjust(20) % v for v in self.fields()])
>         return "\n\n".join(["|".join(line).strip() for line in lines])
>     def __getslice__(self, start=0, stop=None):
>         return self.fields()[start: stop]
>
>
> import itertools
>
> Undefined = object()
> def iterrecords(f, headers=None, bucketheader=Undefined,
>     missingfieldsok=False, dialect="excel", *args, **kw):
>     rows = reader(f, dialect, *args, **kw)
>     for row in itertools.ifilter(None, rows):
>         if not headers:
>             headers = row
>             headcount = len(headers)
>             print headers
>             continue
>         rowcount = len(row)
>         rowheaders = headers
>         if rowcount < headcount:
>             if not missingfieldsok:
>                 raise KeyError("row has more values than headers")
>         elif rowcount > headcount:
>             if bucketheader is Undefined:
>                 raise KeyError("row has more values than headers")
>             rowheaders += [bucketheader] * (rowcount - headcount)
>         record = Record(rowheaders, row)
>         yield record
>
>
>
>
> I should probably also have noted the dictionary API behaviour since it's
> not explicitly:
> keys() -> list of unique() header names.
> values() -> list of field values lists.
> items() -> [(header, field-list),] pairs.
>
> And then of course dictionary lookup.  One thing that comes to mind is
> that there's really no value to the unordered sequence of value lists;
> there could be some value in extending an OrderedDict, making all the
> iteration methods consistent and therefore something that could be used to
> do something like write values, etc….
>
>
>
>
>
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20130130/42df6b2f/attachment.html>