Syncing up iterators with gaps

Wed Sep 28 20:20:26 EDT 2016

On Thu, 29 Sep 2016 05:10 am, Tim Chase wrote:

> I've got several iterators sharing a common key in the same order and
> would like to iterate over them in parallel, operating on all items
> with the same key.  I've simplified the data a bit here, but it would
> be something like
> 
>   data1 = [ # key, data1
>     (1, "one A"),
>     (1, "one B"),
>     (2, "two"),
>     (5, "five"),
>     ]

So data1 has keys 1, 1, 2, 5.

Likewise data2 has keys 1, 2, 3, 3, 3, 4 and data3 has keys 2, 4, 5.

(data3 also has *two* values, not one, which is an additional complication.)

> And I'd like to do something like
> 
>   for common_key, d1, d2, d3 in magic_happens_here(data1, data2, data3):

What's common_key? In particular, given that data1, data2 and data3 have the
first key each of 1, 1 and 2 respectively, how do you get:

> So in the above data, the outer FOR loop would
> happen 5 times with common_key being [1, 2, 3, 4, 5]

I'm confused. Is common_key a *constant* [1, 2, 3, 4, 5] or are you saying
that it iterates over 1, 2, 3, 4, 5?

If the later, it sounds like you want something like a cross between
itertools.groupby and the "merge" stage of mergesort. You have at least
three sorted(?) iterators, representing the CSV files, let's say they
iterate over

data1 = [(1, "one A"), (1, "one B"), (2, "two"), (5, "five")]

data2 = [(1, "uno"), (2, "dos"), (3, "tres x"), (3, "tres y"), 
         (3, "tres z"), (4, "cuatro")]

data3 = [ (2, ("ii", "extra alpha")), (4, ("iv", "extra beta")),
          (5, ("v", "extra gamma"))]

Note that I have modified data3 so instead of three columns, (key value
value), it has two (key value) and value is a 2-tuple.

So first you want an iterator that does an N-way merge:

merged = [(1, "one A"), (1, "one B"), (1, "uno"), 
          (2, "two"), (2, "dos"), (2, ("ii", "extra alpha")), 
          (3, "tres x"), (3, "tres y"), (3, "tres z"),
          (4, "cuatro"), (4, ("iv", "extra beta")),
          (5, "five"), (5, ("v", "extra gamma")),
          ]

and then you can simply call itertools.groupby to group by the common keys:

1: ["one A", "one B", "uno"]
2: ["two", "dos", ("ii", "extra alpha")]
3: ["tres x", "tres y", "tres z"]
4: ["cuatro", ("iv", "extra beta")]
5: ["five", ("v", "extra gamma")]

and then you can extract the separate columns from each value.

You might find it easier to have *all* the iterators yield (key, tuple)
pairs, where data1 and data2 yield a 1-tuple and data3 yields a 2-tuple.

If you look on ActiveState, I'm pretty sure you will find a recipe from
Raymond Hettinger for a merge sort or heap sort or something along those
lines, which you can probably adapt for an arbitrary number of inputs.

-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.