[SciPy-User] Filtering record arrays by contents of columns using `ismember`-like syntax

Mon May 23 12:19:17 EDT 2011

On Mon, May 23, 2011 at 11:07 AM, Skipper Seabold <jsseabold at gmail.com> wrote:
> On Mon, May 23, 2011 at 3:37 AM, Chris Rodgers
> <chris.rodgers at berkeley.edu> wrote:
>> A common task for my work is slicing a record array to find records
>> matching some set of criteria. For any criteria beyond the most
>> simple, I find the syntax grows complex and implementation efficiency
>> starts to matter a lot. I wrote a class to encapsulate this filtering,
>> and I wanted to share it with the list and also get feedback on: 1) am
>> I approaching this problem correctly?, and 2) is the implementation
>> efficient for very large arrays?
>>
>
> I also have this problem quite frequently. I would be interested to
> see something like this as a method of arrays and sub-classes, or at
> least as a convenience function.
>
>> Here's a short script showing the desired functionality. Generally
>> 'col3' contains data that I want to send to some other process, and
>> 'col1' and 'col2' are data parameters that I want to filter by.
>>
>> x = np.recarray(shape=(100000,),
>>    dtype=[('col1', int), ('col2', int), ('col3', float)])
>>
>> # Fill x with actual data here
>>
>> # Find all records where 'col2' is 1, 2, or 4
>> print x[(x['col2'] == 1) | (x['col2'] == 2) | (x['col2'] == 4)]
>>
>> # Find all records where 'col1' is 1, 2, or 4; and 'col1' is 1
>> print x[(x['col1'] == 1) & \
>>    ((x['col2'] == 1) | (x['col2'] == 2) | (x['col2'] == 4))]
>>
>> This is an "idiomatic" usage of record arrays
>> (http://mail.scipy.org/pipermail/numpy-discussion/2009-February/040684.html).
>> I certainly write this kind of code a lot. Problem #1 is that the
>> syntax is hard to read for long chains of conditionals. Problem 2 is
>> that it's hard to generalize the code when the list of acceptable
>> values ([1, 2, 4] in this example) has arbitrary length. For that, you
>> need an equivalent to `ismember` in Matlab.
>>
>> # Here's one way to do it but it's very slow for large datasets
>> print x[np.array([t in [1,2,4] for t in x['col2']])]
>>
>> `in1d` will add this functionality but it's not available my version
>> of numpy, from Synaptic in Ubuntu 10.04. `intersect1d` and
>> `setmember1d` don't work if the lists contain non-unique values. (See
>> http://stackoverflow.com/questions/1273041/how-can-i-implement-matlabs-ismember-command-in-python)
>>
>
> Your version appears to be much faster than in1d for this and
> comparable to doing it explicitly
>
> [~]
> [1]: nobs = 1000000
>
> [~]
> [2]: x = np.zeros(shape=(nobs,),
>   ...:    dtype=[('col1', int), ('col2', int), ('col3', float)])
>
> [~]
> [3]: np.random.seed(12345)
>
> [~]
> [4]: x['col1'] = np.random.randint(1,6,size=nobs)
>
> [~]
> [5]: x['col2'] = np.random.randint(1,6,size=nobs)
>
> [~]
> [6]: ok_value_list = [1,2,4]
>
> [~]
> [7]: colname = 'col2'
>
> [~]
> [8]: timeit reduce(np.logical_or, [x[colname] == i for i in ok_value_list])
> 100 loops, best of 3: 14.9 ms per loop
>
> [~]
> [9]: timeit np.in1d(x[colname],ok_value_list)
> 1 loops, best of 3: 260 ms per loop
>
> [~]
> [10]: paste
> #<snip>
> # Picker class using pick method 2, which I found to be slightly faster
> ## -- End pasted text --
>
> [~]
> [11]: p = Picker(x)
>
> [~]
> [12]: p.pick_mask(col2=[1,2,4])
> [12]: array([ True,  True,  True, ...,  True, False,  True], dtype=bool)
>
> [~]
> [13]: timeit p=Picker(x);p.pick_mask(col2=[1,2,4])
> 100 loops, best of 3: 14.9 ms per loop
>
> [~]
> [14]: ((x['col2'] == 1) | (x['col2'] == 2) | (x['col2'] == 4))
> [14]: array([ True,  True,  True, ...,  True, False,  True], dtype=bool)
>
> [~]
> [15]: timeit ((x['col2'] == 1) | (x['col2'] == 2) | (x['col2'] == 4))
> 100 loops, best of 3: 14.8 ms per loop
>
>
>
>> Anyway, I wrote a simple object `Picker` to encapsulate the desired
>> functionality. You can specify an arbitrary set of columns to filter
>> by, and acceptable values from each column.. So the above code would
>> be re-written as:
>>
>> p = Picker(data=x)
>> # Mask of x that matches the desired values
>> print p.pick_mask(col1=[1], col2=[1,2,4])
>
> It might be nice to have a logical keyword so that these conditions
> could be 'and' or 'or'.
>
>> # Or if you just want 'col3' from the filtered records
>> print p.pick_data('col3', col1=[1], col2=[1,2,4])
>>
>> I think the syntax is much cleaner. Another benefit is that, if there
>> were hundreds of acceptable values for 'col2' instead of three, the
>> code would not be any longer.
>>
>> Here's the class definition:
>>
>> import numpy as np
>> class Picker:
>>    def __init__(self, data):
>>        self._data = data
>>        self._calculate_pick_mask = self._calculate_pick_mask_meth1
>>
>>    def pick_data(self, colname, **kwargs):
>>        return self._data[colname][self._calculate_pick_mask(kwargs)]
>>
>>    def pick_mask(self, **kwargs):
>>        return self._calculate_pick_mask(kwargs)
>>
>>    def _calculate_pick_mask_meth1(self, kwargs):
>>        # Begin with all true
>>        mask = np.ones(self._data.shape, dtype=bool)
>>
>>        for colname, ok_value_list in kwargs.items():
>>            # OR together all records with _data['colname'] in ok_value_list
>>            one_col_mask = np.zeros_like(mask)
>>            for ok_value in ok_value_list:
>>                one_col_mask = one_col_mask | (self._data[colname] == ok_value)

you could make this and the next inline to save an intermediate array

          one_col_mask  |= (self._data[colname] == ok_value)

>>
>>            # AND together the full mask with the results from this column
>>            mask = mask & one_col_mask

            mask &= one_col_mask

If I remember correctly from the numpy ticket, then in1d was optimized
for the case when arr2 is also large, but I thought this had already
changed.

Josef

>>
>>        return mask
>>
>>    def _calculate_pick_mask_meth2(self, kwargs):
>>        mask = reduce(np.logical_and,
>>                        [reduce(np.logical_or,
>>                                [self._data[colname] == ok_value
>>                                    for ok_value in ok_value_list]) \
>>                            for colname, ok_value_list in kwargs.items()])
>>
>>
>> I tried several different implementations of _calculate_pick_mask.
>> Method 1 is the easiest to read. Method 2 is more clever but it didn't
>> actually run any faster for me. Both approaches are much faster than
>> the pure python [val in v for val in a] approach.
>>
>> Is this the right data model for this kind of problem? Is this the
>> best way to implement the filtering? For my datasets, this kind of
>> filtering operation actually ends up taking most of the calculation
>> time, so I'd like to do it quickly while keeping the code readable.
>>
>> Thanks for any comments!
>> Chris
>>
>
> I'm also interested in what others think. At the very least, I think
> this should be added to the cookbook. It would my code much cleaner in
> places.
>
> Skipper
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>