fastest way to find the intersection of n lists of sets

John Nagle nagle at animats.com
Sun Apr 29 19:24:55 EDT 2007


James Stroud wrote:
> Prateek wrote:
> 
>> I have 3 variable length lists of sets. I need to find the common
>> elements in each list (across sets) really really quickly.
>>
>> Here is some sample code:
>>
>> # Doesn't make sense to union the sets - we're going to do
>> intersections later anyway
>> l1 = reduce(operator.add, list(x) for x in l1)
>> l2 = reduce(operator.add, list(x) for x in l2)
>> l3 = reduce(operator.add, list(x) for x in l3)
>>
>> # Should I do this in two steps? Maybe by intersecting the two
>> shortest lists first?
>> s = frozenset(l1) & frozenset(l2) & frozenset(l3)
>>
>> I'm assuming frozensets are (somehow) quicker than sets because
>> they're immutable.
>>
>> Any code suggestions? Maybe using something in the new fancy-schmancy
>> itertools module?
>>
>> Thanks,
>> Prateek
>>
> 
> I don't understand why you cast to list. I would propose:
> 
> lists_of_sets = [l1, l2, l3]
> 
> reduce(set.intersection, (reduce(set.union, x) for x in lists_of_sets))
> 
> Since this is simplest, I'm guessing it should be fastest because I'm 
> also guessing that set impelmentation is as optimized as list--I think 
> this would hold true especially for large sets with sparse overlap 
> between lists. So it might be reasonable consider the content of your 
> sets and lists and write your code based on the content or on assuming a 
> particular composition.

     Python sets are hashes, like dictionaries, not trees.  Intersection
is implemented by iterating over the smallest set and trying all its keys
on the other set.  The Python implementation compares the length of two
sets being intersected.  This is OK (it's about O(N log N), maybe better).

     For the above example, it's worth sorting lists_of_sets by the
length of the sets, and doing the short ones first.

     How big are the sets?  If they're small, but you have a lot of
them, you may be better off with a bit-set representation, then
using AND operations for intersection.  If they're huge (tens of millions
of entries), you might be better off doing sorts and merges on the
sets.

     When you ask questions like this, it's helpful to give some
background.  We don't know whether this is a homework assignment, or
some massive application that's slow and you need to fix it, even
if it requires heavy implementation effort.

					John Nagle



More information about the Python-list mailing list