Best/better way? (histogram)

Wed Jan 28 03:52:49 EST 2009

Bernard Rankin wrote:

> I've got several versions of code to here to generate a histogram-esque
> structure from rows in a CSV file.
> 
> The basic approach is to use a Dict as a bucket collection to count
> instances of data items.
> 
> Other than the try/except(KeyError) idiom for dealing with new bucket
> names, which I don't like as it desribes the initial state of a KeyValue
> _after_ you've just described what to do with the existing value, I've
> come up with a few other methods.
> 
> What seems like to most resonable approuch?

The simplest. That would be #3, cleaned up a bit:

from collections import defaultdict
from csv import DictReader
from pprint import pprint
from operator import itemgetter

def rows(filename):
    infile = open(filename, "rb")
    for row in DictReader(infile):
        yield row["CATEGORIES"]

def stats(values):
    histo = defaultdict(int)
    for v in values:
        histo[v] += 1
    return sorted(histo.iteritems(), key=itemgetter(1), reverse=True)

Should you need the inner dict (which doesn't seem to offer any additional
information) you can always add another step:

def format(items):
    result = []
    for raw, count in items:
        leaf = raw.rpartition("|")[2]
        result.append((raw, dict(count=count, leaf=leaf)))
    return result

pprint(format(stats(rows("sampledata.csv"))), indent=4, width=60)

By the way, if you had broken the problem in steps like above you could have
offered four different stats() functions which would would have been a bit
easier to read...

Peter