itertools.intersect?

David M. Wilson dw at botanicus.net
Thu Jun 11 00:03:10 EDT 2009


On Jun 10, 11:24 pm, David Wilson <d... at botanicus.net> wrote:
> Hi,
>
> During a fun coding session yesterday, I came across a problem that I
> thought was already solved by itertools, but on investigation it seems
> it isn't.
>
> The problem is simple: given one or more ordered sequences, return
> only the objects that appear in each sequence, without reading the
> whole set into memory. This is basically an SQL many-many join.
>
> I thought it could be accomplished through recursively embedded
> generators, but that approach failed in the end. After posting the
> question to Stack Overflow[0], Martin Geisler proposed a wonderfully
> succinct and reusable solution (see below, or pretty printed at the
> Stack Overflow URL).
>
> It is my opinion that this particular implementation is a wonderful
> and incredibly valid use of iterators, and something that could be
> reused by others, certainly least not myself again in the future. With
> that in mind I thought it, or something very similar, would be a great
> addition to the itertools module.
>
> My question then is, are there better approaches to this? The heapq-
> based solution at the Stack Overflow page is potentially more useful
> still, for its ability to operate on orderless sequences, but in that
> case, it might be better to simply listify each sequence, and sort it
> before passing to the ordered-only functions.
>
> Thanks,
>
> David.
>
> Stack Overflow page here:
>
> http://stackoverflow.com/questions/969709/joining-a-set-of-ordered-in...
>
> Sweet solution:
>
>     import operator
>
>     def intersect(sequences):
>         """Compute intersection of sequences of increasing integers.
>
>         >>> list(intersect([[1,   100, 142, 322, 12312],
>         ...                 [2,   100, 101, 322, 1221],
>         ...                 [100, 142, 322, 956, 1222]]))
>         [100, 322]
>
>         """
>         iterators = [iter(seq) for seq in sequences]
>         last = [iterator.next() for iterator in iterators]
>         indices = range(len(iterators))
>         while True:
>             # The while loop stops when StopIteration is raised. The
>             # exception will also stop the iteration by our caller.
>             if reduce(operator.and_, [l == last[0] for l in last]):
>                 # All iterators contain last[0]
>                 yield last[0]
>                 last = [iterator.next() for iterator in iterators]
>
>             # Now go over the iterators once and advance them as
>             # necessary. To stop as soon as the smallest iterator we
>             # advance each iterator only once per loop iteration.
>             for i in indices[:-1]:
>                 if last[i] < last[i+1]:
>                     last[i] = iterators[i].next()
>                 if last[i] > last[i+1]:
>                     last[i+1] = iterators[i+1].next()


I found my answer: Python 2.6 introduces heap.merge(), which is
designed exactly for this.

Thanks all,


David.



More information about the Python-list mailing list