Mapping with continguous ranges of keys

Fri Dec 16 18:15:11 EST 2016

On 16/12/2016 14:27, Steve D'Aprano wrote:
> (2) The clever solution: use a pair of lists, one holding the starting value
> of each group of keys, and the other holding the common values. This saves
> a lot of memory, but is slower.
>
> A concrete example might help. Suppose I have 15 keys in five groups:
>
> D = {0: 10,
>       1: 20, 2: 20,
>       3: 30, 4: 30, 5: 30,
>       6: 40, 7: 40, 8: 40, 9: 40,
>       10: 50, 11: 50, 12: 50, 13: 50, 14: 50}
>
> (Remember, in reality I could have as many as a million or two keys. This is
> just a simple toy example.)
>
> Instead of using a dict, I also set up a pair of lists:
>
> L = [0, 1, 3, 6, 10, 15]  # starting value of each group
> V = [10, 20, 30, 40, 50]  # value associated with each group
I misread the problem (skipped the comment about keys 4 - 99) and 
assumed there might be gaps between the contiguous blocks so thought of 
the list structure
[  ( (firstKeyN, lastKeyN), "value" ),
     ...
]
At the cost of more memory keeping first and last keys numbers in tuples 
in the L list would mean there is only one L lookup at the expence of 
two additional tuple lookups. Are small tuple lookups quicker than 1 
large list lookup? If not stick with the simple L and V you show above.

I've never used any form of tree structure, perhaps someone else could 
comment of the performance of balanced trees as compared to simple 
lists. Would the insertion cost be too high in keeping the tree balanced?

As to speeding up access with only L and V lists the binary search must 
be optimal unless specialised knowledge about the distribution of the 
size of the contiguous groups is made use of. But you say there is no 
pattern to the group sizes.

Other thought is to have a smaller pre-index list, search this to find 
the range of L indexes the key is in. If the pre-index list had a 1000 
entries then each entry would cover 1/1000 of the L list which narrows 
the binary search space in L considerably. The cost of something like 
this is keeping the pre-index list up to date when new keys are added 
and extra time to code and test it.
preList struct [ (firstKeyN, lastKeyN), (firstLindex, lastLindex),
                         ...
                       ]

Reminds me of Jackson's first two rules on optimisation, 1 - don't do 
it, 2 - don't do it yet
Thanks for an interesting problem.
> Note that the first list has one extra item, namely the number one past the
> final group.
>
> I can do a key look-up using either of these:
>
>      D[key]
>
>      V[bisect.bisect_right(L, i) - 1]
>
>
> I tested the memory consumption and speed of these two solutions with
> (approximately) one million keys. I found:
>
> - the dict solution uses a lot more memory, about 24 bytes per key, compared
> to the pair of lists solution, which is about 0.3 bytes per key;
>
> - but on my computer, dict lookups are about 3-4 times faster.
>
>
> Any suggestions for improving the speed of the binary search version, or the
> memory consumption of the dict?
>
> By the way: the particular pattern of groups in the sample code (groups of
> size 1, 2, 3, ... up to 50, then repeating from 1, 2, 3, ... again) is just
> demo. In my real data, the sizes of the groups are all over the place, in
> an unpredictable pattern.
>
>
>
> Thanks in advance.
>
>