Mapping with continguous ranges of keys

Sat Dec 17 04:31:40 EST 2016

Steve D'Aprano wrote:

> I have experimented with two solutions, and hope somebody might be able to
> suggest some improvements. Attached is the test code I ran, suggestions
> for improving the performance will be appreciated.

If there was an attachment to this text -- that didn't make it to the 
mailing list or the news group.

> I decided on these two data structures:
> 
> (1) The naive solution: don't worry about duplicate values, or the fact
> that keys come in contiguous groups, and just store each individual key
> and value in a dict; this is faster but uses more memory.
> 
> (2) The clever solution: use a pair of lists, one holding the starting
> value of each group of keys, and the other holding the common values. This
> saves a lot of memory, but is slower.
> 
> A concrete example might help. Suppose I have 15 keys in five groups:
> 
> D = {0: 10,
>      1: 20, 2: 20,
>      3: 30, 4: 30, 5: 30,
>      6: 40, 7: 40, 8: 40, 9: 40,
>      10: 50, 11: 50, 12: 50, 13: 50, 14: 50}
> 
> (Remember, in reality I could have as many as a million or two keys. This
> is just a simple toy example.)
> 
> Instead of using a dict, I also set up a pair of lists:
> 
> L = [0, 1, 3, 6, 10, 15]  # starting value of each group
> V = [10, 20, 30, 40, 50]  # value associated with each group
> 
> Note that the first list has one extra item, namely the number one past
> the final group.
> 
> I can do a key look-up using either of these:
> 
>     D[key]
> 
>     V[bisect.bisect_right(L, i) - 1]
> 
> 
> I tested the memory consumption and speed of these two solutions with
> (approximately) one million keys. I found:
> 
> - the dict solution uses a lot more memory, about 24 bytes per key,
> compared to the pair of lists solution, which is about 0.3 bytes per key;
> 
> - but on my computer, dict lookups are about 3-4 times faster.

Only three to four times? You basically get that from a no-op function call:

$ python3 -m timeit -s 'd = {1: 2}; k = 1' 'd[k]'
10000000 loops, best of 3: 0.0935 usec per loop
$ python3 -m timeit -s 'd = {1: 2}; k = 1' 'd[int(k)]'
1000000 loops, best of 3: 0.304 usec per loop

Even adding a dummy value to V to go from

>     V[bisect.bisect_right(L, i) - 1]

to

V[bisect.bisect_right(L, i)]

might be visible in your benchmark.

> Any suggestions for improving the speed of the binary search version, or
> the memory consumption of the dict?

Depending on the usage pattern you might try bisect combined with a LRU 
cache. 

If runs of four or more nearby keys are common you can remember the current 
span:

# untested, watch out for off-by-one errors ;)
start = stop = 0
for key in data:
    if start <= key < stop:
        pass # reuse value
    else:
        index = bisect.bisect_right(L, key)
        start, stop = L[index: index + 2]
        value = V[index - 1]

    # use value

(You might also fill V with (value, start, stop) tuples))

> By the way: the particular pattern of groups in the sample code (groups of
> size 1, 2, 3, ... up to 50, then repeating from 1, 2, 3, ... again) is
> just demo. In my real data, the sizes of the groups are all over the
> place, in an unpredictable pattern.
> 
> 
> 
> Thanks in advance.
> 
>