[SciPy-User] Filtering record arrays by contents of columns using `ismember`-like syntax

Mon May 23 03:37:28 EDT 2011

A common task for my work is slicing a record array to find records
matching some set of criteria. For any criteria beyond the most
simple, I find the syntax grows complex and implementation efficiency
starts to matter a lot. I wrote a class to encapsulate this filtering,
and I wanted to share it with the list and also get feedback on: 1) am
I approaching this problem correctly?, and 2) is the implementation
efficient for very large arrays?

Here's a short script showing the desired functionality. Generally
'col3' contains data that I want to send to some other process, and
'col1' and 'col2' are data parameters that I want to filter by.

x = np.recarray(shape=(100000,),
    dtype=[('col1', int), ('col2', int), ('col3', float)])

# Fill x with actual data here

# Find all records where 'col2' is 1, 2, or 4
print x[(x['col2'] == 1) | (x['col2'] == 2) | (x['col2'] == 4)]

# Find all records where 'col1' is 1, 2, or 4; and 'col1' is 1
print x[(x['col1'] == 1) & \
    ((x['col2'] == 1) | (x['col2'] == 2) | (x['col2'] == 4))]

This is an "idiomatic" usage of record arrays
(http://mail.scipy.org/pipermail/numpy-discussion/2009-February/040684.html).
I certainly write this kind of code a lot. Problem #1 is that the
syntax is hard to read for long chains of conditionals. Problem 2 is
that it's hard to generalize the code when the list of acceptable
values ([1, 2, 4] in this example) has arbitrary length. For that, you
need an equivalent to `ismember` in Matlab.

# Here's one way to do it but it's very slow for large datasets
print x[np.array([t in [1,2,4] for t in x['col2']])]

`in1d` will add this functionality but it's not available my version
of numpy, from Synaptic in Ubuntu 10.04. `intersect1d` and
`setmember1d` don't work if the lists contain non-unique values. (See
http://stackoverflow.com/questions/1273041/how-can-i-implement-matlabs-ismember-command-in-python)

Anyway, I wrote a simple object `Picker` to encapsulate the desired
functionality. You can specify an arbitrary set of columns to filter
by, and acceptable values from each column.. So the above code would
be re-written as:

p = Picker(data=x)
# Mask of x that matches the desired values
print p.pick_mask(col1=[1], col2=[1,2,4])
# Or if you just want 'col3' from the filtered records
print p.pick_data('col3', col1=[1], col2=[1,2,4])

I think the syntax is much cleaner. Another benefit is that, if there
were hundreds of acceptable values for 'col2' instead of three, the
code would not be any longer.

Here's the class definition:

import numpy as np
class Picker:
    def __init__(self, data):
        self._data = data
        self._calculate_pick_mask = self._calculate_pick_mask_meth1

    def pick_data(self, colname, **kwargs):
        return self._data[colname][self._calculate_pick_mask(kwargs)]

    def pick_mask(self, **kwargs):
        return self._calculate_pick_mask(kwargs)

    def _calculate_pick_mask_meth1(self, kwargs):
        # Begin with all true
        mask = np.ones(self._data.shape, dtype=bool)

        for colname, ok_value_list in kwargs.items():
            # OR together all records with _data['colname'] in ok_value_list
            one_col_mask = np.zeros_like(mask)
            for ok_value in ok_value_list:
                one_col_mask = one_col_mask | (self._data[colname] == ok_value)

            # AND together the full mask with the results from this column
            mask = mask & one_col_mask

        return mask

    def _calculate_pick_mask_meth2(self, kwargs):
        mask = reduce(np.logical_and,
                        [reduce(np.logical_or,
                                [self._data[colname] == ok_value
                                    for ok_value in ok_value_list]) \
                            for colname, ok_value_list in kwargs.items()])

I tried several different implementations of _calculate_pick_mask.
Method 1 is the easiest to read. Method 2 is more clever but it didn't
actually run any faster for me. Both approaches are much faster than
the pure python [val in v for val in a] approach.

Is this the right data model for this kind of problem? Is this the
best way to implement the filtering? For my datasets, this kind of
filtering operation actually ends up taking most of the calculation
time, so I'd like to do it quickly while keeping the code readable.

Thanks for any comments!
Chris

--
Chris Rodgers
Helen Wills Neuroscience Institute
University of California - Berkeley