Syncing up iterators with gaps

Wed Sep 28 15:10:01 EDT 2016

I've got several iterators sharing a common key in the same order and
would like to iterate over them in parallel, operating on all items
with the same key.  I've simplified the data a bit here, but it would
be something like

  data1 = [ # key, data1
    (1, "one A"),
    (1, "one B"),
    (2, "two"),
    (5, "five"),
    ]

  data2 = [ # key, data1
    (1, "uno"),
    (2, "dos"),
    (3, "tres x"),
    (3, "tres y"),
    (3, "tres z"),
    (4, "cuatro"),
    ]

  data3 = [ # key, data1, data2
    (2, "ii", "extra alpha"),
    (4, "iv", "extra beta"),
    (5, "v", "extra gamma"),
    ]

And I'd like to do something like

  for common_key, d1, d2, d3 in magic_happens_here(data1, data2, data3):
    for row in d1:
      process_a(common_key, row)
    for thing in d2:
      process_b(common_key, row)
    for thing in d3:
      process_c(common_key, row)

which would yield the common_key, along with enough of each of those
iterators (note that gaps can happen, but the sortable order should
remain the same).  So in the above data, the outer FOR loop would
happen 5 times with common_key being [1, 2, 3, 4, 5], and each of
[d1, d2, d3] being an iterator that deals with just that data.

My original method was hauling everything into memory and making
multiple passes filtering on the data. However, the actual sources
are CSV-files, some of which are hundreds of megs in size, and my
system was taking a bit of a hit.  So I was hoping for a way to do
this with each iterator making only one complete pass through each
source (since they're sorted by common key).

It's somewhat similar to the *nix "join" command, only dealing with
N files.

Thanks for any hints.

-tkc