[Python-ideas] exclusively1, common, exclusively2 = set1 - set2, set1 & set2, set2 - set1

Thu Jul 4 04:33:47 CEST 2013

On 04/07/13 11:41, Juancarlo Añez wrote:
> On Wed, Jul 3, 2013 at 7:46 PM, Oscar Benjamin
> <oscar.j.benjamin at gmail.com>wrote:
>
>> def partition(setx, sety):
>>      xonly, xandy, yonly = set(), set(), set()
>>      for set1, set2, setn in [(setx, sety, xonly), (sety, setx, yonly)]:
>>          for val in set1:
>>              if val in set2:
>>                  xandy.add(val)
>>              else:
>>                  setn.add(val)
>>      return xonly, xandy, yonly
>>
>
>
> I don't understand why that can be more efficient that using the built-in
> operations:
>
> def partition(setx, sety):
>      common = setx & sety
>      return setx - common, common, sety - common
>
>
> I assume that the built-in operations traverse over the set with the
> smallest size, and preallocate the result to that size.

I don't think you can preallocate sets (or dicts) that way, since the offset of each item can depend on what items were already added in which order. So you can't know how many gaps to leave between items ahead of time, or where to put them.

The obvious algorithm for set1 - set2 would be:

def diff(set1, set2):
     result = set()
     for item in set1:
         if item not in set2: result.add(item)
     return result

Are you suggesting this instead?

def diff(set1, set2):
     if len(set1) <= len(set2):
         result = set()
         for item in set1:
             if item not in set2: result.add(item)
     else:
         result = set1.copy()
         for item in set2:
             if item in set1: result.remove(item)
     return result

Either way, you still have to traverse set1 in full, either explicitly, or when you copy it. The overhead of copying the set when you might end up removing all the items suggests to me that there is no benefit in iterating over the smaller set.

Likewise, set2 - set1 has to traverse set2 in full, regardless of which is smaller.

set1 & set2 has to check every element of both sets. So that's four traversals in total, each set getting traversed twice.

Oscar's version above only traverses each set once, making two in total. However, being in pure Python, it's probably an order of magnitude slower than calling the C set methods. That's my guess, I leave it to someone else to actually time the code and see :-)

-- 
Steven