itertools.intersect?

Terry Reedy tjreedy at udel.edu
Thu Jun 11 14:54:13 EDT 2009


Jack Diederich wrote:
> On Thu, Jun 11, 2009 at 12:03 AM, David M. Wilson<dw at botanicus.net> wrote:
> [snip]
>> I found my answer: Python 2.6 introduces heap.merge(), which is
>> designed exactly for this.
> 
> Thanks, I knew Raymond added something like that but I couldn't find
> it in itertools.
> That said .. it doesn't help.  Aside, heapq.merge fits better in
> itertools (it uses heaps internally but doesn't require them to be
> passed in).  The other function that almost helps is
> itertools.groupby() and it doesn't return an iterator so is an odd fit
> for itertools.
> 
> More specifically (and less curmudgeonly) heap.merge doesn't help for
> this particular case because you can't tell where the merged values
> came from.  You want all the iterators to yield the same thing at once
> but heapq.merge muddles them all together (but in an orderly way!).
> Unless I'm reading your tokenizer func wrong it can yield the same
> value many times in a row.  If that happens you don't know if four
> "The"s are once each from four iterators or four times from one.

David is looking to intersect sorted lists of document numbers with 
duplicates removed in order to find documents that contain worda and 
wordb and wordc ... .  But you are right that duplicate are a possible 
fly in the ointment to be removed before merging.




More information about the Python-list mailing list