[Python-ideas] csv.DictReader could handle headers more intelligently.

Wed Jan 30 13:59:17 CET 2013

> So I've done some thinking on it, a bit of research, etc., and have worked with a lot of different CSV content.  There are a lot of parallels between the name/value pairs of an HTML form submission, and our use case.  
> 
> Namely:
> 	- There's typically only one value per name, but it's perfectly legal to have multiple values assigned to a name.
> 	- When there are duplicate multiple values assigned to a name, order can be very important. 
> 	- They made the mistake of mapping names to values; they made the mistake of mapping name field names to singular values when there was only one value, and multiple values where there were multiple values.  
> 	- Each of these have been deprecated an their FieldStorage now always maps field names to lists of values.  
> 
> I've implemented a Record class I'm going to pitch for feedback.  Although I followed the FieldStorage API for a couple of methods, it didn't translate very well because their values are complex objects.  This Record class is a dictionary type that maps header names to the values from columns labeled by that same header.  Most lists have a single field because usually headers aren't duplicated.  When multiple values are in a field, they are listed in the order they were read from the CSV file.  The API provides convenience methods for getting the first or last value listed for a given column name, making it very easy to turn work with singular values when desired.  The dictionary API will likely bent primary mechanism for interacting with it, however, knows the header and row sequences it was built from, and provides sequential access to them as well.  In addition to working with non-standard CSV, performing transformations, etc.this information makes it possible to reproduce correctly ordered CSV.
> 
> While I don't really know yet whether it would make sense to support any kind of manipulation of values on the record instances themselves, versus using more copy()/update() approach to defining modifying records or something, but I did decide to wrap the row values in a tuple, making it read only.  This was for several reasons.   One was to address a potential inconsistency that might arise should we decide to support editing, and the other is because the record is the representation of that row read from the source file, and so it should always accurately reflect that content.
> 
> About the code: I wrote it tonight, tested it for an hour, so it's not meant to be perfect or final, but it should stir up a very concrete discussion about the API, if nothing else ;-)  I included a generator that seemed to work on the some test files.  It most definitely is not meant to be critiqued or a distraction, but I've included it in case anyone ends up wanting to investigate the things further.  Although the iterator function provides a slightly different signature that DictReader, that's not because I'm trying toe change anything; please keep in mind the generator was just a test.  Also, I'd like to mention one last time that I don't think we should change what exists to reflect any of these changes: I was thinking it would be a new set of classes and functions that, that would become the preferred implementation in the future.  
> 
> 
> 
> 
>> class Record(dict):
>>     def __init__(self, headers, fields):
>>         if len(headers) != len(fields):
>>             # I don't make decicions about how gaps should be filled. 
>>             raise ValueError("header/field size mismatch")
>>         self._headers = headers
>>         self._fields = tuple(fields)
>>         [self.setdefault(h,[]).append(v) for h,v in self.fielditems()]
>>         super(Record, self).__init__()
>>     def fielditems(self):
>>         """
>>             Get header,value sequence that reflects CSV source.  
>>         """
>>         return zip(self.headers(),self.fields())
>>     def headers(self):
>>         """
>>             Get ordered sequence of headers reflecting CSV source. 
>>         """
>>         return self._headers
>>     def fields(self):
>>         """
>>             Get ordered sequence of values reflecting CSV row source. 
>>         """
>>         return self._fields
>>     def getfirst(self, name, default=None):
>>         """
>>             Get value of last field associated with header named  
>>             'name'; return 'default' if no such value exists. 
>>         """
>>         return self[name][0] if name in self else default
>>     def getlast(self, name, default=None):
>>         """
>>             Get value of last field associated with header named  
>>             'name'; return 'default' if no such value exists. 
>>         """
>>         return self[name][-1] if name in self else default
>>     def getlist(self, name): 
>>         """
>>             Get values of all fields associated with header named 'name'.
>>         """
>>         return self.get(name, [])
>>     def pretty(self, header=True):
>>         lines = []
>>         if header:
>>             lines.append(
>>                 ["%s".ljust(10).rjust(20) % h for h in self.headers()])
>>         lines.append(
>>             ["%s".ljust(10).rjust(20) % v for v in self.fields()])
>>         return "\n\n".join(["|".join(line).strip() for line in lines])
>>     def __getslice__(self, start=0, stop=None):
>>         return self.fields()[start: stop]
>> 
>> 
>> import itertools
>> 
>> Undefined = object()
>> def iterrecords(f, headers=None, bucketheader=Undefined, 
>>     missingfieldsok=False, dialect="excel", *args, **kw):
>>     rows = reader(f, dialect, *args, **kw)
>>     for row in itertools.ifilter(None, rows):
>>         if not headers:
>>             headers = row
>>             headcount = len(headers)
>>             print headers
>>             continue
>>         rowcount = len(row)
>>         rowheaders = headers
>>         if rowcount < headcount:
>>             if not missingfieldsok:
>>                 raise KeyError("row has more values than headers")
>>         elif rowcount > headcount: 
>>             if bucketheader is Undefined:
>>                 raise KeyError("row has more values than headers")
>>             rowheaders += [bucketheader] * (rowcount - headcount)
>>         record = Record(rowheaders, row)
>>         yield record
> 

I should probably also have noted the dictionary API behaviour since it's not explicitly: 
	keys() -> list of unique() header names.
	values() -> list of field values lists.
	items() -> [(header, field-list),] pairs.

And then of course dictionary lookup.  One thing that comes to mind is that there's really no value to the unordered sequence of value lists; there could be some value in extending an OrderedDict, making all the iteration methods consistent and therefore something that could be used to do something like write values, etc….

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20130130/ef4af81d/attachment.html>