fastest way to merge lists.
Tim Peters
tim_one at email.msn.com
Sun Jan 9 01:38:11 EST 2000
[Roy Smith]
> I've got 5 (unsorted) lists of strings. Each list has
> O(10^4) strings in it. Within each list, each string appears
> exactly once. Also, there is a large overlap between the lists.
>
> I want to form a single list which contains exactly one copy
> of each string from the combined lists. In unix, this would be:
>
> cat list[12345] | sort | uniq
Thanks for the careful description! It helps.
> I'm not yet sure if care if the final list is sorted or not.
> For now, I think it doesn't matter. What's the fastest way
> to do that?
>
> It seems to me:
>
> d = {}
> for s in list1:
> d[s] = 1
> for s in list2:
> d[s] = 1
> for s in list3:
> d[s] = 1
> for s in list4:
> d[s] = 1
> for s in list5:
> d[s] = 1
> list = d.keys()
> del d # if I care about memory
>
> would be pretty good. Anything better?
Can't beat it, asymptotically speaking -- it's a linear-time algorithm.
Practically speaking, Python's dicts are highly optimized too. You can save
a little typing via e.g.
for s in list1, list2, list3, list4, list5:
for x in s:
d[x] = 1
If you're keen on small speedups, you can set the dict elements on the same
line as the inner "for" (maybe save a percent or two due to avoiding
SET_LINENO opcodes), or play with operator.setitem to obfuscate <0.7 wink>
the inner loop away. That's all the tricks I can think of; I wouldn't
bother with either.
> Now, if I want it sorted, I could just sort the above list.
Yes.
> I suspect that if I knew ahead of time I wanted the final
> list sorted, sorting each individual list then doing a linear
> merge might make more sense, if there was little overlap
> between the lists.
It's much more likely to pay off in another language. There's no free lunch
here: if you're sorting a total of N distinct elements via comparisons,
it's going to take order N*log2(N) time in all whether you do them all at
once or sort them in pieces and merge at the end. It's true that
N*log2(N) [time to sort the combined list]
is "bigger than"
5 * (N/5)*log2(N/5) = [time to sort 5 lists 5 times smaller]
N*log2(N/5)
but the 5-way merge at the end of the latter takes an average of log2(5)
comparisons (don't think about that one too hard <wink>) at each of the N
merge steps, and
N*log2(N/5) + N*log2(5) = N*(log2(N/5) + log2(5)) =
N*log2(N)
That is, same thing! The key point in *Python* is that list.sort() works at
C speed, but the 5-way merge will work at Python speed, so the latter
strategy will almost certainly be significantly slower.
> But, since I know they overlap greatly (i.e. len(list) will
> be approximately the same as len(listN)), I'm guessing doing
> it as described above and sorting the result might just work
> out faster.
Right, and no question about it in this case -- for large enough N, sorting
dominates the time (it's the only worse-than-linear part of the algorithm),
and sorting one list of len N/5 is going to go about about 5 times faster
than sorting 5 of them.
don't-even-need-to-time-it<wink>-ly y'rs - tim
More information about the Python-list
mailing list