[scikit-learn] NearestNeighbors without replacement

Jacob Vanderplas jakevdp at cs.washington.edu
Mon Apr 2 14:15:29 EDT 2018


Hi Randy,
I think that approach is probably a good heuristic, but it will not
necessarily find the optimal result. That said, if you don't care about
having guarantees that you're finding the optimal pairing, but only that
you can find a reasonable set of pairs, it will probably work out fine.
   Jake

 Jake VanderPlas
 Senior Data Science Fellow
 Director of Open Software
 University of Washington eScience Institute

On Mon, Apr 2, 2018 at 10:47 AM, Randy Ellis <randalljellis at gmail.com>
wrote:

> Hi Jake,
>
> Thanks for the reply. Yes, trying this out resulted from looking for ways
> in python to implement propensity score matching. I found a package,
> pscore_match (http://www.kellieottoboni.com/pscore_match/), but the
> matching was really terrible. Specifically, I'm matching based on age,
> race, gender, HIV status, hepatitis C status, and sickle-cell disease
> status. Using NearestNeighbors for matching performed WAY better, I was so
> surprised at how well every factor was matched for. The only issue is that
> it uses replacement.
>
> Here's what I'm currently testing. I need each case to match to 20
> controls, so since NearestNeighbors uses replacement, I'm matching each
> case to many controls (15000), taking all of the distances for all of the
> pairs, and retaining only the smallest distances for each control. Since
> many controls are re-used (since the algorithm uses replacement), the hope
> is that enough controls are matched to many different cases so that each
> case ends up being matched to 20 unique controls. Does this method make
> sense??
>
> Best,
>
> Randy
>
> On Sun, Apr 1, 2018 at 10:13 PM, Jacob Vanderplas <
> jakevdp at cs.washington.edu> wrote:
>
>> On Sun, Apr 1, 2018 at 6:36 PM, Randy Ellis <randalljellis at gmail.com>
>> wrote:
>>
>>> Hello to the Scikit-learn community!
>>>
>>> I am doing case-control matching for an electronic health records study.
>>> My question is, is it possible to run Sklearn's NearestNeighbors function
>>> without replacement? As in, match the treated group to the untreated group
>>> without re-using any of the untreated group data points? If so, how? By
>>> default, it uses replacement. I know this because I tested it on some data
>>> of mine.
>>>
>>> The code I used is in the confirmed answer here:
>>> https://stats.stackexchange.com/questions/206832/matched-pai
>>> rs-in-python-propensity-score-matching
>>>
>>> Thanks so much in advance,
>>>
>>
>> No, pairwise matching without replacement is not implemented within
>> scikit-learn's nearest neighbors routines.
>>
>> It seems like an algorithm you would have to think carefully about
>> because the number of potential pairs grows exponentially with the number
>> of points, and I don't think it's true that choosing the nearest available
>> neighbor of points in sequence will guarantee you to find the optimal
>> configuration. You'd also have to carefully define what you mean by
>> "optimal"... are you seeking to minimize the sum of all distances? The sum
>> of squared distances? The maximum distance? The results would change
>> depending on the metric you define. And you'd probably have to figure out
>> some way to reduce the exponential search space in order to calculate the
>> result in a reasonable amount of time for your data.
>>
>> You might look into the literature on propensity score matching; I think
>> that's one area where this kind of neighbors-without-replacement algorithm
>> is often used.
>>
>> Best,
>>    Jake
>>
>>
>>>
>>> --
>>> *Randall J. Ellis, B.S.*
>>> PhD Student, Biomedical Science, Mount Sinai
>>> Special Volunteer, http://www.michaelideslab.org/, NIDA IRP
>>> Cell: (954)-260-9891 <(954)%20260-9891>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
>
> --
> *Randall J. Ellis, B.S.*
> PhD Student, Biomedical Science, Mount Sinai
> Special Volunteer, http://www.michaelideslab.org/, NIDA IRP
> Cell: (954)-260-9891 <(954)%20260-9891>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180402/ee5dbfbf/attachment-0001.html>


More information about the scikit-learn mailing list