fastest data structure for retrieving objects identified by (x,y) tuple?

Wed Oct 3 21:58:16 EDT 2012

On Wed, 03 Oct 2012 18:30:24 -0400, Benjamin Jessup wrote:

> I have a group of objects identified by unique (x,y) pairs and I want to
> find out an object's "neighbors" in a matrix of size 2400 x 2400.
[...]
> There is either a neighbor, or a null value. I always know the (x,y)
> pair to check the neighbors of, so is doing,
>  >> obj = grid[x][y] #lists, doesn't scale with num of objects
> or,
>  >> obj = grid.get((x,y),None) #dictionary, scales with num of objects
> the fastest? I can't seem to find a conclusion by testing each alone,
> then in the full environment. Is it that, depending on the number of
> objects, each has an advantage?

Almost certainly not. Since you don't show how you test each, I have no 
idea why you can't find a conclusion.

Let's start off with the question you don't ask: which uses less memory? 
The answer is, the dict solution by far. Here is a test:

# populate a random matrix using both dict and list
adict = {}
alist = [[None]*2400 for i in range(2400)]
from random import randrange
for i in range(1000):
    x = randrange(2400)
    y = randrange(2400)
    adict[(x, y)] = "something"
    alist[x][y] = "something"

import sys
print(sys.getsizeof(adict))
print(sys.getsizeof(alist) + sum(sys.getsizeof(L) for L in alist))

The actual sizes printed will depend on how sparse the matrices are, but 
for the same above (approximately half full), using Python 2.7, the 
figures I get are:

adict: 24712
alist: 23127324

Now let's test the speed:

# randomly select matrix coordinates to look-up
test_coords = []
for i in range(1000):
    x = randrange(2400)
    y = randrange(2400)
    test_coords.append((x, y))

# build some test code
from timeit import Timer
setup = "from __main__ import adict, alist, test_coords"
t1 = Timer("for p in test_coords: obj = adict.get(p)", setup)
t2 = Timer("for p in test_coords: obj = alist[p[0]][p[1]]", setup)

# run the test code
print(min(t1.repeat(number=10000, repeat=7)))
print(min(t2.repeat(number=10000, repeat=7)))

Again, on my system using Python 2.7, I get:
3.13823986053
2.97539305687

which shows that the two are very close, but the list solution is about 
6% faster. So in this situation, a list of lists uses about 100 times 
more memory than a dict, but look-ups are about 6% faster.

I would be very surprised if the timing results depended on the number of 
objects in the matrices.

In case you are not familiar with timeit, let me explain what I have done:

* I pre-select 1000 random coordinates.
* I write some test code inside a Timer object that look up each of 
  those coordinates, plus some setup code.
* timeit runs the setup code once.
* Then it runs my test code 10000 times as a single trial, timing
  it as accurately as possible.
* It does seven trials, and I report the best of the seven.

The time printed is measured in seconds. In this case, I get 3 seconds 
per trial, or 3e-7 seconds = 0.3 microseconds per lookup.

> I know the fastest way to retrieve them would be to have them store
> pointers to their neighbors, then use those for retrieval.

How do you know that?

No offence, but if you can't even work out whether lookups in a dict or a 
list are faster, I can't imagine why you think you can intuit what the 
fastest way to retrieve the nearest neighbours would be.

-- 
Steven