[Numpy-discussion] Fast Access to Container of Numpy Arrays on Disk?

Thu Jan 14 08:48:45 EST 2016

Well, maybe something like a simple class emulating a dictionary that
stores a key-value on disk would be more than enough.  Then you can use
whatever persistence layer that you want (even HDF5, but not necessarily).

As a demonstration I did a quick and dirty implementation for such a
persistent key-store thing (
https://gist.github.com/FrancescAlted/8e87c8762a49cf5fc897).  On it, the
KeyStore class (less than 40 lines long) is responsible for storing the
value (2 arrays) into a key (a directory).  As I am quite a big fan of
compression, I implemented a couple of serialization flavors: one using the
.npz format (so no other dependencies than NumPy are needed) and the other
using the ctable object from the bcolz package (bcolz.blosc.org).  Here are
some performance numbers:

python key-store.py -f numpy -d __test -l 0
########## Checking method: numpy (via .npz files) ############
Building database.  Wait please...
Time (            creation) --> 1.906
Retrieving 100 keys in arbitrary order...
Time (               query) --> 0.191
Number of elements out of getitem: 10518976
faltet at faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test

75M     __test

So, with the NPZ format we can deal with the 75 MB quite easily.  But NPZ
can compress data as well, so let's see how it goes:

$ python key-store.py -f numpy -d __test -l 9
########## Checking method: numpy (via .npz files) ############
Building database.  Wait please...
Time (            creation) --> 6.636
Retrieving 100 keys in arbitrary order...
Time (               query) --> 0.384
Number of elements out of getitem: 10518976
faltet at faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test
28M     __test

Ok, in this case we have got almost a 3x compression ratio, which is not
bad.  However, the performance has degraded a lot.  Let's use now bcolz.
First in non-compressed mode:

$ python key-store.py -f bcolz -d __test -l 0
########## Checking method: bcolz (via ctable(clevel=0, cname='blosclz')
############
Building database.  Wait please...
Time (            creation) --> 0.479
Retrieving 100 keys in arbitrary order...
Time (               query) --> 0.103
Number of elements out of getitem: 10518976
faltet at faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test
82M     __test

Without compression, bcolz takes a bit more (~10%) space than NPZ.
However, bcolz is actually meant to be used with compression on by default:

$ python key-store.py -f bcolz -d __test -l 9
########## Checking method: bcolz (via ctable(clevel=9, cname='blosclz')
############
Building database.  Wait please...
Time (            creation) --> 0.487
Retrieving 100 keys in arbitrary order...
Time (               query) --> 0.98
Number of elements out of getitem: 10518976
faltet at faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test
29M     __test

So, the final disk usage is quite similar to NPZ, but it can store and
retrieve lots faster.  Also, the data decompression speed is on par to
using non-compression.  This is because bcolz uses Blosc behind the scenes,
which is much faster than zlib (used by NPZ) --and sometimes faster than a
memcpy().  However, even we are doing I/O against the disk, this dataset is
so small that fits in the OS filesystem cache, so the benchmark is actually
checking I/O at memory speeds, not disk speeds.

In order to do a more real-life comparison, let's use a dataset that is
much larger than the amount of memory in my laptop (8 GB):

$ PYTHONPATH=. python key-store.py -f bcolz -m 1000000 -k 5000 -d
/media/faltet/docker/__test -l 0
########## Checking method: bcolz (via ctable(clevel=0, cname='blosclz')
############
Building database.  Wait please...
Time (            creation) --> 133.650
Retrieving 100 keys in arbitrary order...
Time (               query) --> 2.881
Number of elements out of getitem: 91907396
faltet at faltet-Latitude-E6430:~/blosc/bcolz$ du -sh
/media/faltet/docker/__test

39G     /media/faltet/docker/__test

and now, with compression on:

$ PYTHONPATH=. python key-store.py -f bcolz -m 1000000 -k 5000 -d
/media/faltet/docker/__test -l 9
########## Checking method: bcolz (via ctable(clevel=9, cname='blosclz')
############
Building database.  Wait please...
Time (            creation) --> 145.633
Retrieving 100 keys in arbitrary order...
Time (               query) --> 1.339
Number of elements out of getitem: 91907396
faltet at faltet-Latitude-E6430:~/blosc/bcolz$ du -sh
/media/faltet/docker/__test

12G     /media/faltet/docker/__test

So, we are still seeing the 3x compression ratio.  But the interesting
thing here is that the compressed version works a 50% faster than the
uncompressed one (13 ms/query vs 29 ms/query).  In this case I was using a
SSD (hence the low query times), so the compression advantage is even more
noticeable than when using memory as above (as expected).

But anyway, this is just a demonstration that you don't need heavy tools to
achieve what you want.  And as a corollary, (fast) compressors can save you
not only storage, but processing time too.

Francesc

2016-01-14 11:19 GMT+01:00 Nathaniel Smith <njs at pobox.com>:

> I'd try storing the data in hdf5 (probably via h5py, which is a more
> basic interface without all the bells-and-whistles that pytables
> adds), though any method you use is going to be limited by the need to
> do a seek before each read. Storing the data on SSD will probably help
> a lot if you can afford it for your data size.
>
> On Thu, Jan 14, 2016 at 1:15 AM, Ryan R. Rosario <ryan at bytemining.com>
> wrote:
> > Hi,
> >
> > I have a very large dictionary that must be shared across processes and
> does not fit in RAM. I need access to this object to be fast. The key is an
> integer ID and the value is a list containing two elements, both of them
> numpy arrays (one has ints, the other has floats). The key is sequential,
> starts at 0, and there are no gaps, so the “outer” layer of this data
> structure could really just be a list with the key actually being the
> index. The lengths of each pair of arrays may differ across keys.
> >
> > For a visual:
> >
> > {
> > key=0:
> >         [
> >                 numpy.array([1,8,15,…, 16000]),
> >                 numpy.array([0.1,0.1,0.1,…,0.1])
> >         ],
> > key=1:
> >         [
> >                 numpy.array([5,6]),
> >                 numpy.array([0.5,0.5])
> >         ],
> > …
> > }
> >
> > I’ve tried:
> > -       manager proxy objects, but the object was so big that low-level
> code threw an exception due to format and monkey-patching wasn’t successful.
> > -       Redis, which was far too slow due to setting up connections and
> data conversion etc.
> > -       Numpy rec arrays + memory mapping, but there is a restriction
> that the numpy arrays in each “column” must be of fixed and same size.
> > -       I looked at PyTables, which may be a solution, but seems to have
> a very steep learning curve.
> > -       I haven’t tried SQLite3, but I am worried about the time it
> takes to query the DB for a sequential ID, and then translate byte arrays.
> >
> > Any ideas? I greatly appreciate any guidance you can provide.
> >
> > Thanks,
> > Ryan
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion at scipy.org
> > https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
>
> --
> Nathaniel J. Smith -- http://vorpus.org
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>

-- 
Francesc Alted
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20160114/263c6961/attachment.html>