[Numpy-discussion] numpy videos

Tue Mar 13 12:24:11 EDT 2012

On Mar 13, 2012, at 7:31 AM, Sturla Molden wrote:

> On 12.03.2012 23:23, Abhishek Pratap wrote:
>> Super awesome. I love how the python community in general keeps the
>> recordings available for free.
>> 
>> @Adam : I do have some problems that I can hit numpy with, mainly
>> bigData based. So in summary I have millions/billions of rows of
>> biological data on which I want to run some computation but at the
>> same time have a capability to do quick lookup. I am not sure if numpy
>> will be applicable for quick lookups  by a string based key right ??
> 
> 
> Jason Kinser's book on Python for bioinformatics might be of interest. Though I don't always agree with his NumPy coding style.
> 
> As for "big data", it is a problem regardless of language. The HDF5 library might be of help (cf. PyTables or h5py, I actually prefer the latter).

Yes, however IMO PyTables does adapt better to the OP lookup user case.  For example, let's suppose a very simple key-value problem, where we need to locate a certain value by using a key.  Using h5py I get:

In [1]: import numpy as np

In [2]: N = 100*1000

In [3]: sa = np.fromiter((('key'+str(i), i) for i in xrange(N)), dtype="S8,i4")

In [4]: import h5py

In [5]: f = h5py.File('h5py.h5', 'w')

In [6]: d = f.create_dataset('sa', data=sa)

In [7]: time [val for val in d if val[0] == 'key500']
CPU times: user 28.34 s, sys: 0.06 s, total: 28.40 s
Wall time: 29.25 s
Out[7]: [('key500', 500)]

Another option is to use fancy selection:

In [8]: time d[d['f0']=='key500']
CPU times: user 0.01 s, sys: 0.00 s, total: 0.01 s
Wall time: 0.01 s
Out[8]: 
array([('key500', 500)], 
      dtype=[('f0', 'S8'), ('f1', '<i4')])

Hmm, time resolution is too poor here.  Let's use the %timeit magic:

In [9]: timeit d[d['f0']=='key500']
100 loops, best of 3: 9.3 ms per loop

which is much better.  But, in this case you need to load the column d['f0'] completely in-memory, and this is *not* what you want when you have large tables that does not fit in-memory.

Using PyTables:

In [10]: import tables

In [11]: ft = tables.openFile('pytables.h5', 'w')

In [12]: dt = ft.createTable(ft.root, 'sa', sa)

In [13]: time [val[:] for val in dt if val[0] == 'key500']
CPU times: user 0.04 s, sys: 0.01 s, total: 0.05 s
Wall time: 0.04 s
Out[13]: [('key500', 500)]

That's almost a 100x of speed-up compared with h5py.  But, in addition, PyTables has specific machinery to optimize these queries by using the numexpr behind the scenes:

In [14]: time [val[:] for val in dt.where("f0=='key500'")]
CPU times: user 0.01 s, sys: 0.00 s, total: 0.01 s
Wall time: 0.00 s
Out[14]: [('key500', 500)]

Again, time resolution is too poor here.  Let's use timeit magic:

In [15]: timeit [val[:] for val in dt.where("f0=='key500'")]
100 loops, best of 3: 2.36 ms per loop

This is an additional 10x speed-up.  In fact, this is almost as fast as performing the query using NumPy directly:

In [16]: timeit sa[sa['f0']=='key500']
100 loops, best of 3: 2.14 ms per loop

with the difference that PyTables uses an out-of-core paradigm (i.e. it does not need to load the datasets completely in-memory).  And finally, PyTables does support true indexing capabilities, so that you do not have to read the complete dataset for getting results:

In [17]: dt.cols.f0.createIndex()
Out[17]: 100000

In [18]: timeit [val[:] for val in dt.where("f0=='key500'")]
1000 loops, best of 3: 213 us per loop

which accounts for another additional 10x speedup.  Of course, this speed up can be *much* more larger for bigger datasets, and specially for those that does not fit in-memory.  See:

http://pytables.github.com/usersguide/optimization.html#accelerating-your-searches

for more detailed rational and benchmarks in big datasets.

-- Francesc Alted