fastest way to find the intersection of n lists of sets

John Nagle nagle at animats.com
Sun Apr 29 20:08:23 EDT 2007


Prateek wrote:
>>     For the above example, it's worth sorting lists_of_sets by the
>>length of the sets, and doing the short ones first.
> 
> 
> Thanks. I thought so - I'm doing just that using a simple Decorate-
> Sort-Undecorate idiom.
> 
> 
>>     How big are the sets?  If they're small, but you have a lot of
>>them, you may be better off with a bit-set representation, then
>>using AND operations for intersection.  If they're huge (tens of millions
>>of entries), you might be better off doing sorts and merges on the
>>sets.
> 
> 
> I have either 2 or 3 sets (never more) which can be arbitrarily large.
> Most of them are small (between 0 and few elements - say less that 5).
> A few will be megamonstrous ( > 100,000 items)
> 
> 
>>     When you ask questions like this, it's helpful to give some
>>background.  We don't know whether this is a homework assignment, or
>>some massive application that's slow and you need to fix it, even
>>if it requires heavy implementation effort.
>>
> 
> Its definitely not a homework assignment - its part of a commercial
> database query engine. Heavy implementation effort is no problem.
> 
> Prateek

    If you're intersecting a set of 5 vs a set of 100,000, the
intersection time won't be the problem.  That's just five lookups.
It's building a set of 100,000 items that may be time consuming.

    Does the big set stay around for a while, or do you have to pay
that cost on each query?

    Those really aren't big data sets on modern machines.

					John Nagle



More information about the Python-list mailing list