Best/better way? (histogram)

Wed Jan 28 09:08:46 EST 2009

> 
> The simplest. That would be #3, cleaned up a bit:
> 
> from collections import defaultdict
> from csv import DictReader
> from pprint import pprint
> from operator import itemgetter
> 
> def rows(filename):
>     infile = open(filename, "rb")
>     for row in DictReader(infile):
>         yield row["CATEGORIES"]
> 
> def stats(values):
>     histo = defaultdict(int)
>     for v in values:
>         histo[v] += 1
>     return sorted(histo.iteritems(), key=itemgetter(1), reverse=True)
> 
> Should you need the inner dict (which doesn't seem to offer any additional
> information) you can always add another step:
> 
> def format(items):
>     result = []
>     for raw, count in items:
>         leaf = raw.rpartition("|")[2]
>         result.append((raw, dict(count=count, leaf=leaf)))
>     return result
> 
> pprint(format(stats(rows("sampledata.csv"))), indent=4, width=60)
> 
> By the way, if you had broken the problem in steps like above you could have
> offered four different stats() functions which would would have been a bit
> easier to read...
> 

Thank you.  The code reorganization does make make it easer to read.

I'll have to look up the docs on itemgetter()

:)