[issue38938] Possible performance improvement for heaqq.merge()

Fri May 15 06:21:37 EDT 2020

Dennis Sweeney <sweeney.dennis650 at gmail.com> added the comment:

As Serhiy suggested, keeping the algorithm but moving the Python implementation to be a generator again (as I recently changed in PR 18427) gives another performance boost (although this unrolling is many lines of code).

Timing the C implementation:

    .\python.bat -m pyperf timeit -s "from random import random; from collections import deque; from heapq import merge;  iters = [sorted(random() for j in range(10_000)) for i in range(20)]" "deque(merge(*iters), maxlen=0)"

        On Master: Mean +- std dev: 66.9 ms +- 0.6 ms
        With PR:   Mean +- std dev: 17.3 ms +- 0.9 ms

Timing the Python Implementation in CPython:

    .\python.bat -m pyperf timeit -s "from random import random; from collections import deque; from test import support; merge = support.import_fresh_module('heapq', blocked=['_heapq']).merge; iters = [sorted(random() for j in range(10_000)) for i in range(20)]" "deque(merge(*iters), maxlen=0)"

        On Master: Mean +- std dev: 385 ms +- 3 ms
        With PR:   Mean +- std dev: 250 ms +- 2 ms

Timing PyPy:

    pypy -m pyperf timeit -s "... from heapq import merge; ..." ...
    versus
    pypy -m pyperf timeit -s "... from heapq2 import merge; ..." ...

        Testing on PyPy 7.1.1: Mean +- std dev: 142 ms +- 2 ms
        This implementation:   Mean +- std dev: 38.0 ms +- 1.2 ms

=============================================================================

Another positive for this proposal I just discovered is that object.__eq__ no longer needs to be called at each comparison:

    from heapq import merge
    from collections import deque

    class Int(int):
        lt = eq = 0
        def __lt__(self, other):
            __class__.lt += 1
            return int.__lt__(self, other)

        def __eq__(self, other):
            __class__.eq += 1
            return int.__eq__(self, other)

    def comparisons(iterables):
        Int.lt = Int.eq = 0
        deque(merge(*iterables), maxlen=0)
        return Int.lt, Int.eq

    no_overlap = comparisons(
        # (0..999), (1_000..1_999), (2_000..2_999), ...
        map(Int, range(x, x+1_000))
        for x in range(0, 16_000, 1_000)
    )

    interleaved = comparisons(
        # (0,16,32,...), (1,17,33,...), (2,18,34,...), ...
        map(Int, range(x, 16_000, 16))
        for x in range(16)
    )

    print("No overlap: {:,} lt; {:,} eq".format(*no_overlap))
    print("Interleaved: {:,} lt; {:,} eq".format(*interleaved))

Before:

    No overlap: 65,004 lt; 65,004 eq
    Interleaved: 64,004 lt; 64,004 eq

After:

    No overlap: 32,000 lt; 0 eq
    Interleaved: 63,968 lt; 0 eq

This comes from the way that tuples are compared by scanning item-wise for equality before comparing the first discrepancy. Using the positional information in the tree with the logic 

    yield (right if right < left else left)

requires only one rich comparison, while breaking ties with the stored index and the logic 

    yield (b if (b, b_index) < (a, a_index) else a)

requires two arbitrary rich comparisons (a != b, then a < b). This can be somewhat fixed with the logic

    if a_index < b_index:
        yield (b if b < a else a)
    else:
        yield (a if a < b else b)

...but this is another branch in the innermost loop.

==============================================================================

I also played with a "Tree of Losers" method[1] here[2], and on the same benchmarks I got:

    CPython (C): Mean +- std dev: 22.7 ms +- 0.2 ms
        (slower than in PR 18427)
    CPython (Pure Python): Mean +- std dev: 197 ms +- 9 ms
        (faster -- likely because of fewer int operations)
    PyPy: Mean +- std dev: 38.8 ms +- 0.8 ms
        (about the same)

Maybe some more optimizations could be made to that code.

The two approaches should be "isomorphic" in some sense: both should do the same number of comparisons on the same data. But I'll emphasize the differences:

Tree of Losers:
    - Data structure: Loser Tree (does not satisfy heap invariant)
    - Nodes: (item, index) pairs
    - Tree has n nodes
    - To get the next item: do exchanges from leaf to root, swapping as you find better items
    - Comparisons: branching logic based on index comparison

PR 18427 "Tree of winners":
    - Data structure: binary heap (always satisfies heap invariant)
    - Nodes: just the items
    - Tree has 2n-1 nodes (more eagerly evaluated)
    - To get the next item: replace parent with better child, root to leaf
    - Comparisons: right < left (simple)

Personally, I find the Tree of Winners approach more intuitive, but the Tree of Losers approach seems to be the one that comes up more in the literature.

[1] https://en.wikipedia.org/wiki/K-way_merge_algorithm
[2] https://github.com/sweeneyde/cpython/tree/replacement-selection

----------

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue38938>
_______________________________________