[Numpy-discussion] huge array calculation speed

Thu Jul 10 19:48:12 EDT 2008

On Jul 10, 2008, at 6:38 PM, Dan Lussier wrote:
> What I'm trying to do is to calculate the total total potential
> energy and coordination number of each atom within a relatively large
> simulation.

Anne Archibald already responded:
> you could try a three-dimensional grid (if your atoms aren't
> too clumpy) or octrees (if they are somewhat clumpy); kd-trees are
> probably overkill (since they're designed for higher-dimensional
> problems).

I implemented something like what you did in VMD about 14(!) years ago.
(VMD is a molecular structure visualization program.)  I needed to
find which atoms might be bonded to each other.  I assume you have
a cutoff value, which means you don't need to worry about the general
nearest-neighbor problem.

Molecules are nice because the distributions are not clumpy.  Atoms
don't get that close to nor that far from other atoms.  I implemented
a grid.  It goes something like this:

import collections

# Search within 3 A
d = 3.0

coordinates = [
   ("C1", 34.287,  50.970, 115.006),
   ("O1", 34.972,  51.144, 113.870),
   ("C2", 33.929,  52.255, 115.739),
   ("N2", 34.753,  52.387, 116.954),
   ("C3", 32.448,  52.219, 116.121),
   ("O3", 32.033,  50.877, 116.336),
   ("C4", 31.528,  52.817, 115.054),
   ("C5", 30.095,  53.013, 115.558),
   ("C6", 29.226,  53.835, 114.609),
   ("C7", 29.807,  55.217, 114.304),
   ("C8", 29.092,  55.920, 113.147),
   ("C9", 29.525,  57.375, 112.971),
   ("C10", 28.409,  58.267, 112.422),
   ("C11", 28.828,  59.734, 112.294),
   ("C12", 27.902,  60.542, 111.385),
   ("C13", 26.617,  60.996, 112.085),
   ("C14", 26.182,  62.401, 111.667),
   ("C15", 24.739,  62.731, 112.054),
   ("C16", 24.251,  64.046, 111.441),
   ("C17", 23.026,  64.624, 112.150),
   ("C18", 22.631,  66.007, 111.623),]

def dict_of_list():
   return collections.defaultdict(list)

def dict_of_dict():
   return collections.defaultdict(dict_of_list)

grid = collections.defaultdict(dict_of_dict)

for atom in coordinates:
   name,x,y,z = atom
   i = int(x/d)
   j = int(y/d)
   k = int(z/d)
   grid[i][j][k].append(atom)

query_name, query_x, query_y, query_z = coordinates[8]
query_i = int(query_x/d)
query_j = int(query_y/d)
query_k = int(query_z/d)

# Given the search distance 'd', only need to check cells up to 1  
unit away
within_names = set()
for i in range(query_i-1, query_i+2):
   for j in range(query_j-1, query_j+2):
     for k in range(query_k-1, query_k+2):
       for atom in grid[i][j][k]:
         name, x, y, z = atom
         print "Check", atom,
         query_d2 = (x-query_x)**2+(y-query_y)**2+(z-query_z)**2
         if query_d2 < d*d:
           print "Within!", query_d2
           within_names.add(name)
         else:
           print "Too far", query_d2

print len(within_names), "were close enough"

# Linear search to verify
count = 0
for name, x, y, z in coordinates:
     query_d2 = (x-query_x)**2+(y-query_y)**2+(z-query_z)**2
     if query_d2 < d*d:
         print "Within", name
         if name not in within_names:
            raise AssertionError(name)
         count += 1

if count != len(within_names):
     raise AssertionError("count problem")

You can also grab the KDTree from Biopython, which is implemented in C.

   http://www.biopython.org/DIST/docs/api/Bio.KDTree.KDTree'-module.html

It was designed for just this task.

				Andrew
				dalke at dalkescientific.com