Mapping with continguous ranges of keys

Fri Dec 16 09:27:43 EST 2016

On Fri, 16 Dec 2016 04:06 am, Steve D'Aprano wrote:

> I have some key:value data where the keys often are found in contiguous
> ranges with identical values.
[...]

Thank you to everyone who gave suggestions!

I have experimented with two solutions, and hope somebody might be able to
suggest some improvements. Attached is the test code I ran, suggestions for
improving the performance will be appreciated.

I decided on these two data structures:

(1) The naive solution: don't worry about duplicate values, or the fact that
keys come in contiguous groups, and just store each individual key and
value in a dict; this is faster but uses more memory.

(2) The clever solution: use a pair of lists, one holding the starting value
of each group of keys, and the other holding the common values. This saves
a lot of memory, but is slower.

A concrete example might help. Suppose I have 15 keys in five groups:

D = {0: 10,
     1: 20, 2: 20,
     3: 30, 4: 30, 5: 30,
     6: 40, 7: 40, 8: 40, 9: 40,
     10: 50, 11: 50, 12: 50, 13: 50, 14: 50}

(Remember, in reality I could have as many as a million or two keys. This is
just a simple toy example.)

Instead of using a dict, I also set up a pair of lists:

L = [0, 1, 3, 6, 10, 15]  # starting value of each group
V = [10, 20, 30, 40, 50]  # value associated with each group

Note that the first list has one extra item, namely the number one past the
final group.

I can do a key look-up using either of these:

    D[key]

    V[bisect.bisect_right(L, i) - 1]

I tested the memory consumption and speed of these two solutions with
(approximately) one million keys. I found:

- the dict solution uses a lot more memory, about 24 bytes per key, compared
to the pair of lists solution, which is about 0.3 bytes per key;

- but on my computer, dict lookups are about 3-4 times faster.

Any suggestions for improving the speed of the binary search version, or the
memory consumption of the dict?

By the way: the particular pattern of groups in the sample code (groups of
size 1, 2, 3, ... up to 50, then repeating from 1, 2, 3, ... again) is just
demo. In my real data, the sizes of the groups are all over the place, in
an unpredictable pattern.

Thanks in advance.

-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.