Best/better way? (histogram)

Wed Jan 28 02:02:58 EST 2009

Hello,

I've got several versions of code to here to generate a histogram-esque structure from rows in a CSV file.

The basic approach is to use a Dict as a bucket collection to count instances of data items.

Other than the try/except(KeyError) idiom for dealing with new bucket names, which I don't like as it desribes the initial state of a KeyValue _after_ you've just described what to do with the existing value, I've come up with a few other methods.

What seems like to most resonable approuch?
Do you have any other ideas?
Is the try/except(KeyError) idiom reallyteh best?

In the code below you will see several 4-line groups of code.  Each of set of the n-th line represents one solution to the problem.  (Cases 1 & 2 do differ from cases 3 & 4 in the final outcome.)   

Thank you
:)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

from collections import defaultdict
from csv import DictReader
from pprint import pprint

dataFile = open("sampledata.csv")
dataRows = DictReader(dataFile)

catagoryStats = defaultdict(lambda : {'leaf' : '', 'count' : 0})
#catagoryStats = {}
#catagoryStats = defaultdict(int)
#catagoryStats = {}

for row in dataRows:
    catagoryRaw    = row['CATEGORIES']
    catagoryLeaf    = catagoryRaw.split('|').pop()

    ## csb => Catagory Stats Bucket
    ## multi-statement lines are used for ease of method switching.

    csb = catagoryStats[catagoryRaw]; csb['count'] += 1; csb['leaf'] = catagoryLeaf
    #csb = catagoryStats.setdefault(catagoryRaw, {'leaf' : '', 'count' : 0}); csb['count'] += 1; csb['leaf'] = catagoryLeaf
    #catagoryStats[catagoryRaw] += 1
    #catagoryStats[catagoryRaw] = catagoryStats.get(catagoryRaw, 0) + 1

catagoryStatsSorted = catagoryStats.items()

catagoryStatsSorted.sort(key=lambda itemtuple: itemtuple[1]['count'], reverse=1)
#catagoryStatsSorted.sort(key=lambda itemtuple: itemtuple[1]['count'], reverse=1)
#catagoryStatsSorted.sort(key=lambda itemtuple: itemtuple[1], reverse=1)
#catagoryStatsSorted.sort(key=lambda itemtuple: itemtuple[1], reverse=1)

pprint(catagoryStatsSorted, indent=4, width=60)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
sampledata.csv
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CATEGORIES,SKU
"computers|laptops|accessories",12345
"computers|laptops|accessories",12345
"computers|laptops|accessories",12345
"computers|servers|accessories",12345
"computers|servers|accessories",12345
"computers|servers|accessories",12345
"computers|servers|accessories",12345
"computers|servers|accessories",12345
"toys|really|super_fun",12345
"toys|really|super_fun",12345
"toys|really|super_fun",12345
"toys|really|not_at_all_fun",12345

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
output: (in case #1)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In [1]: %run catstat.py
[   (   'computers|servers|accessories',
        {'count': 5, 'leaf': 'accessories'}),
    (   'toys|really|super_fun',
        {'count': 3, 'leaf': 'super_fun'}),
    (   'computers|laptops|accessories',
        {'count': 3, 'leaf': 'accessories'}),
    (   'toys|really|not_at_all_fun',
        {'count': 1, 'leaf': 'not_at_all_fun'})]