sampling items from a nested list

Steven Bethard steven.bethard at gmail.com
Wed Feb 16 17:33:38 EST 2005


So, I have a list of lists, where the items in each sublist are of 
basically the same form.  It looks something like:

py> data = [[('a', 0),
...          ('b', 1),
...          ('c', 2)],
...
...         [('d', 2),
...          ('e', 0)],
...
...         [('f', 0),
...          ('g', 2),
...          ('h', 1),
...          ('i', 0),
...          ('j', 0)]]

Now, I'd like to sample down the number of items in each sublist in the 
following manner.  I need to count the occurrences of each 'label' (the 
second item in each tuple) in all the items of all the sublists, and 
randomly remove some items until the number of occurrences of each 
'label' is equal.  So, given the data above, one possible resampling 
would be:

     [[('b', 1),
       ('c', 2)],

      [('e', 0)],

      [('g', 2),
       ('h', 1),
       ('i', 0)]]

Note that there are now only 2 examples of each label.  I have code that 
does this, but it's a little complicated:

py> import random
py> def resample(data):
...     # determine which indices are associated with each label
...     label_indices = {}
...     for i, group in enumerate(data):
...         for j, (item, label) in enumerate(group):
...             label_indices.setdefault(label, []).append((i, j))
...     # sample each set of indices down
...     min_count = min(len(indices)
...                     for indices in label_indices.itervalues())
...     for label, indices in label_indices.iteritems():
...         label_indices[label] = random.sample(indices, min_count)
...     # return the resampled data
...     return [[(item, label)
...              for j, (item, label) in enumerate(group)
...              if (i, j) in label_indices[label]]
...             for i, group in enumerate(data)]
...
py>
py> resample(data)
[[('b', 1), ('c', 2)], [('d', 2), ('e', 0)], [('h', 1), ('i', 0)]]
py> resample(data)
[[('b', 1), ('c', 2)], [('d', 2)], [('f', 0), ('h', 1), ('j', 0)]]

Can anyone see a simpler way of doing this?

Steve



More information about the Python-list mailing list