[Numpy-discussion] numpy videos
Francesc Alted
francesc at continuum.io
Tue Mar 13 12:24:11 EDT 2012
On Mar 13, 2012, at 7:31 AM, Sturla Molden wrote:
> On 12.03.2012 23:23, Abhishek Pratap wrote:
>> Super awesome. I love how the python community in general keeps the
>> recordings available for free.
>>
>> @Adam : I do have some problems that I can hit numpy with, mainly
>> bigData based. So in summary I have millions/billions of rows of
>> biological data on which I want to run some computation but at the
>> same time have a capability to do quick lookup. I am not sure if numpy
>> will be applicable for quick lookups by a string based key right ??
>
>
> Jason Kinser's book on Python for bioinformatics might be of interest. Though I don't always agree with his NumPy coding style.
>
> As for "big data", it is a problem regardless of language. The HDF5 library might be of help (cf. PyTables or h5py, I actually prefer the latter).
Yes, however IMO PyTables does adapt better to the OP lookup user case. For example, let's suppose a very simple key-value problem, where we need to locate a certain value by using a key. Using h5py I get:
In [1]: import numpy as np
In [2]: N = 100*1000
In [3]: sa = np.fromiter((('key'+str(i), i) for i in xrange(N)), dtype="S8,i4")
In [4]: import h5py
In [5]: f = h5py.File('h5py.h5', 'w')
In [6]: d = f.create_dataset('sa', data=sa)
In [7]: time [val for val in d if val[0] == 'key500']
CPU times: user 28.34 s, sys: 0.06 s, total: 28.40 s
Wall time: 29.25 s
Out[7]: [('key500', 500)]
Another option is to use fancy selection:
In [8]: time d[d['f0']=='key500']
CPU times: user 0.01 s, sys: 0.00 s, total: 0.01 s
Wall time: 0.01 s
Out[8]:
array([('key500', 500)],
dtype=[('f0', 'S8'), ('f1', '<i4')])
Hmm, time resolution is too poor here. Let's use the %timeit magic:
In [9]: timeit d[d['f0']=='key500']
100 loops, best of 3: 9.3 ms per loop
which is much better. But, in this case you need to load the column d['f0'] completely in-memory, and this is *not* what you want when you have large tables that does not fit in-memory.
Using PyTables:
In [10]: import tables
In [11]: ft = tables.openFile('pytables.h5', 'w')
In [12]: dt = ft.createTable(ft.root, 'sa', sa)
In [13]: time [val[:] for val in dt if val[0] == 'key500']
CPU times: user 0.04 s, sys: 0.01 s, total: 0.05 s
Wall time: 0.04 s
Out[13]: [('key500', 500)]
That's almost a 100x of speed-up compared with h5py. But, in addition, PyTables has specific machinery to optimize these queries by using the numexpr behind the scenes:
In [14]: time [val[:] for val in dt.where("f0=='key500'")]
CPU times: user 0.01 s, sys: 0.00 s, total: 0.01 s
Wall time: 0.00 s
Out[14]: [('key500', 500)]
Again, time resolution is too poor here. Let's use timeit magic:
In [15]: timeit [val[:] for val in dt.where("f0=='key500'")]
100 loops, best of 3: 2.36 ms per loop
This is an additional 10x speed-up. In fact, this is almost as fast as performing the query using NumPy directly:
In [16]: timeit sa[sa['f0']=='key500']
100 loops, best of 3: 2.14 ms per loop
with the difference that PyTables uses an out-of-core paradigm (i.e. it does not need to load the datasets completely in-memory). And finally, PyTables does support true indexing capabilities, so that you do not have to read the complete dataset for getting results:
In [17]: dt.cols.f0.createIndex()
Out[17]: 100000
In [18]: timeit [val[:] for val in dt.where("f0=='key500'")]
1000 loops, best of 3: 213 us per loop
which accounts for another additional 10x speedup. Of course, this speed up can be *much* more larger for bigger datasets, and specially for those that does not fit in-memory. See:
http://pytables.github.com/usersguide/optimization.html#accelerating-your-searches
for more detailed rational and benchmarks in big datasets.
-- Francesc Alted
More information about the NumPy-Discussion
mailing list