fastest way to merge lists.

Sun Jan 9 01:38:11 EST 2000

[Roy Smith]
> I've got 5 (unsorted) lists of strings.  Each list has
> O(10^4) strings in it.  Within each list, each string appears
> exactly once.  Also, there is a large overlap between the lists.
>
> I want to form a single list which contains exactly one copy
> of each string from the combined lists.  In unix, this would be:
>
> cat list[12345] | sort | uniq

Thanks for the careful description!  It helps.

> I'm not yet sure if care if the final list is sorted or not.
> For now, I think it doesn't matter.  What's the fastest way
> to do that?
>
> It seems to me:
>
> d = {}
> for s in list1:
>    d[s] = 1
> for s in list2:
>    d[s] = 1
> for s in list3:
>    d[s] = 1
> for s in list4:
>    d[s] = 1
> for s in list5:
>    d[s] = 1
> list = d.keys()
> del d   # if I care about memory
>
> would be pretty good.  Anything better?

Can't beat it, asymptotically speaking -- it's a linear-time algorithm.
Practically speaking, Python's dicts are highly optimized too.  You can save
a little typing via e.g.

for s in list1, list2, list3, list4, list5:
    for x in s:
        d[x] = 1

If you're keen on small speedups, you can set the dict elements on the same
line as the inner "for" (maybe save a percent or two due to avoiding
SET_LINENO opcodes), or play with operator.setitem to obfuscate <0.7 wink>
the inner loop away.  That's all the tricks I can think of; I wouldn't
bother with either.

> Now, if I want it sorted, I could just sort the above list.

Yes.

> I suspect that if I knew ahead of time I wanted the final
> list sorted,  sorting each individual list then doing a linear
> merge might make more sense, if there was little overlap
> between the lists.

It's much more likely to pay off in another language.  There's no free lunch
here:  if you're sorting a total of N distinct elements via comparisons,
it's going to take order N*log2(N) time in all whether you do them all at
once or sort them in pieces and merge at the end.  It's true that

    N*log2(N)   [time to sort the combined list]

is "bigger than"

    5 * (N/5)*log2(N/5) = [time to sort 5 lists 5 times smaller]
    N*log2(N/5)

but the 5-way merge at the end of the latter takes an average of log2(5)
comparisons (don't think about that one too hard <wink>) at each of the N
merge steps, and

    N*log2(N/5) + N*log2(5) = N*(log2(N/5) + log2(5)) =
    N*log2(N)

That is, same thing!  The key point in *Python* is that list.sort() works at
C speed, but the 5-way merge will work at Python speed, so the latter
strategy will almost certainly be significantly slower.

> But, since I know they overlap greatly (i.e. len(list) will
> be approximately the same as len(listN)), I'm guessing doing
> it as described above and sorting the result might just work
> out faster.

Right, and no question about it in this case -- for large enough N, sorting
dominates the time (it's the only worse-than-linear part of the algorithm),
and sorting one list of len N/5 is going to go about about 5 times faster
than sorting 5 of them.

don't-even-need-to-time-it<wink>-ly y'rs  - tim