sampling items from a nested list

Thu Feb 17 15:50:27 EST 2005

Steven Bethard wrote:

> py> data = [[('a', 0),
> ...          ('b', 1),
> ...          ('c', 2)],
> ...
> ...         [('d', 2),
> ...          ('e', 0)],
> ...
> ...         [('f', 0),
> ...          ('g', 2),
> ...          ('h', 1),
> ...          ('i', 0),
> ...          ('j', 0)]]
>
> I need to count the occurrences of each 'label' (the second item in
> each tuple) in all the items of all the sublists, and randomly remove
> some items until the number of occurrences of each 'label' is equal.

If the tuples are "heavier" than this, you can avoid comparing them
using the following algorithm (which probably still leaves some room for
optimization, e.g. simpler return_list building [or returning a
generator instead of a list], or directly building the sample set
instead of converting a random.sample to a set):

def resample(data):
    counts = {}
    for i in data:
        for j in i:
            counts[j[1]] = counts.setdefault(j[1], 0) + 1

    min_count = min(counts.itervalues())

    # Same keys, so we can reuse the counts dictionary.
    indices = counts
    for label, count in counts.iteritems():
        indices[label] = set(random.sample(xrange(count), min_count))

    # Same thing with a generator expression, building a new dict (dunno
    # what's faster).
    #indices = dict(((label, set(random.sample(xrange(count), min_count)))
    #                for label, count in counts.iteritems()))

    # "done" maps labels to the number of tuples (with that label) which
    # have been added to return_list.
    done = {}
    return_list = []
    for i in data:
        return_list.append([])
        for j in i:
            if done.setdefault(j[1], 0) in indices[j[1]]:
                return_list[-1].append(j)
            done[j[1]] += 1
    return return_list

-- 
Felix Wiemann -- http://www.ososo.de/