[Python-ideas] csv.DictReader could handle headers more intelligently.

Wed Jan 30 13:24:53 CET 2013

So I've done some thinking on it, a bit of research, etc., and have worked with a lot of different CSV content.  There are a lot of parallels between the name/value pairs of an HTML form submission, and our use case.  

Namely:
	- There's typically only one value per name, but it's perfectly legal to have multiple values assigned to a name.
	- When there are duplicate multiple values assigned to a name, order can be very important. 
	- They made the mistake of mapping names to values; they made the mistake of mapping name field names to singular values when there was only one value, and multiple values where there were multiple values.  
	- Each of these have been deprecated an their FieldStorage now always maps field names to lists of values.  

I've implemented a Record class I'm going to pitch for feedback.  Although I followed the FieldStorage API for a couple of methods, it didn't translate very well because their values are complex objects.  This Record class is a dictionary type that maps header names to the values from columns labeled by that same header.  Most lists have a single field because usually headers aren't duplicated.  When multiple values are in a field, they are listed in the order they were read from the CSV file.  The API provides convenience methods for getting the first or last value listed for a given column name, making it very easy to turn work with singular values when desired.  The dictionary API will likely bent primary mechanism for interacting with it, however, knows the header and row sequences it was built from, and provides sequential access to them as well.  In addition to working with non-standard CSV, performing transformations, etc.this information makes it possible to reproduce correctly ordered CSV.

While I don't really know yet whether it would make sense to support any kind of manipulation of values on the record instances themselves, versus using more copy()/update() approach to defining modifying records or something, but I did decide to wrap the row values in a tuple, making it read only.  This was for several reasons.   One was to address a potential inconsistency that might arise should we decide to support editing, and the other is because the record is the representation of that row read from the source file, and so it should always accurately reflect that content.

About the code: I wrote it tonight, tested it for an hour, so it's not meant to be perfect or final, but it should stir up a very concrete discussion about the API, if nothing else ;-)  I included a generator that seemed to work on the some test files.  It most definitely is not meant to be critiqued or a distraction, but I've included it in case anyone ends up wanting to investigate the things further.  Although the iterator function provides a slightly different signature that DictReader, that's not because I'm trying toe change anything; please keep in mind the generator was just a test.  Also, I'd like to mention one last time that I don't think we should change what exists to reflect any of these changes: I was thinking it would be a new set of classes and functions that, that would become the preferred implementation in the future.  

> class Record(dict):
>     def __init__(self, headers, fields):
>         if len(headers) != len(fields):
>             # I don't make decicions about how gaps should be filled. 
>             raise ValueError("header/field size mismatch")
>         self._headers = headers
>         self._fields = tuple(fields)
>         [self.setdefault(h,[]).append(v) for h,v in self.fielditems()]
>         super(Record, self).__init__()
>     def fielditems(self):
>         """
>             Get header,value sequence that reflects CSV source.  
>         """
>         return zip(self.headers(),self.fields())
>     def headers(self):
>         """
>             Get ordered sequence of headers reflecting CSV source. 
>         """
>         return self._headers
>     def fields(self):
>         """
>             Get ordered sequence of values reflecting CSV row source. 
>         """
>         return self._fields
>     def getfirst(self, name, default=None):
>         """
>             Get value of last field associated with header named  
>             'name'; return 'default' if no such value exists. 
>         """
>         return self[name][0] if name in self else default
>     def getlast(self, name, default=None):
>         """
>             Get value of last field associated with header named  
>             'name'; return 'default' if no such value exists. 
>         """
>         return self[name][-1] if name in self else default
>     def getlist(self, name): 
>         """
>             Get values of all fields associated with header named 'name'.
>         """
>         return self.get(name, [])
>     def pretty(self, header=True):
>         lines = []
>         if header:
>             lines.append(
>                 ["%s".ljust(10).rjust(20) % h for h in self.headers()])
>         lines.append(
>             ["%s".ljust(10).rjust(20) % v for v in self.fields()])
>         return "\n\n".join(["|".join(line).strip() for line in lines])
>     def __getslice__(self, start=0, stop=None):
>         return self.fields()[start: stop]
> 
> 
> import itertools
> 
> Undefined = object()
> def iterrecords(f, headers=None, bucketheader=Undefined, 
>     missingfieldsok=False, dialect="excel", *args, **kw):
>     rows = reader(f, dialect, *args, **kw)
>     for row in itertools.ifilter(None, rows):
>         if not headers:
>             headers = row
>             headcount = len(headers)
>             print headers
>             continue
>         rowcount = len(row)
>         rowheaders = headers
>         if rowcount < headcount:
>             if not missingfieldsok:
>                 raise KeyError("row has more values than headers")
>         elif rowcount > headcount: 
>             if bucketheader is Undefined:
>                 raise KeyError("row has more values than headers")
>             rowheaders += [bucketheader] * (rowcount - headcount)
>         record = Record(rowheaders, row)
>         yield record

# That's run within the context of the "csv" module to work… maybe.  

Shane Green 
www.umbrellacode.com
408-692-4666 | shane at umbrellacode.com

On Jan 30, 2013, at 2:32 AM, Mark Hackett <mark.hackett at metoffice.gov.uk> wrote:

> On Tuesday 29 Jan 2013, Eric V. Smith wrote:
>> On 1/29/2013 3:37 PM, Stephen J. Turnbull wrote:
>>> Eric V. Smith writes:
>>>> True. But my point stands: it's possible to read the data (even with a
>>>> DictReader), do something with the data, and not know the column names
>>>> in advance. It's not an impossible use case.
>>> 
>>> But it is.  Dicts don't guarantee iteration order, so you will most
>>> likely get an output file that not only has a different delimiter, but
>>> a different order of fields.
>> 
>> We're going to have to agree to disagree. Order is not always important.
>> 
> 
> It's not impossible that we're living in a simulated world.
> 
> If you don't know what's in the csv file at all, then how do you know what 
> you're supposed to do with it.
> 
> Reading into a list will ensure order, so that is usable if order is 
> important. If the names aren't important at all, then you should drop the first 
> line and read it into a list again. If the names are important, you'd better 
> know what names the headers are using.
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20130130/d19663e7/attachment.html>