[Numpy-discussion] Fast Access to Container of Numpy Arrays on Disk?

Thu Jan 14 13:51:48 EST 2016

Hi Ryan,

Did you consider packing the arrays into one(two) giant array stored with mmap?

That way you only need to store the start & end offsets, and there is
no need to use a dictionary.
It may allow you to simplify some numerical operations as well.

To be more specific,

start : numpy.intp
end : numpy.intp

data1 : numpy.int32
data2 : numpy.float64

Then your original access to the dictionary can be rewritten as

data1[start[key]:end[key]
data2[start[key]:end[key]

Whether to wrap this as a dictionary-like object is just a matter of
taste -- depending you like it raw or fine.

If you need to apply some global transformation to the data, then
something like data2[...] *= 10 would work.

ufunc.reduceat(data1, ....) can be very useful as well. (with some
tricks on start /end)

I was facing a similar issue a few years ago, and you may want to look
at this code (It wasn't very well written I had to admit)

https://github.com/rainwoodman/gaepsi/blob/master/gaepsi/tools/__init__.py#L362

Best,

- Yu

On Thu, Jan 14, 2016 at 1:15 AM, Ryan R. Rosario <ryan at bytemining.com> wrote:
> Hi,
>
> I have a very large dictionary that must be shared across processes and does not fit in RAM. I need access to this object to be fast. The key is an integer ID and the value is a list containing two elements, both of them numpy arrays (one has ints, the other has floats). The key is sequential, starts at 0, and there are no gaps, so the “outer” layer of this data structure could really just be a list with the key actually being the index. The lengths of each pair of arrays may differ across keys.
>
> For a visual:
>
> {
> key=0:
>         [
>                 numpy.array([1,8,15,…, 16000]),
>                 numpy.array([0.1,0.1,0.1,…,0.1])
>         ],
> key=1:
>         [
>                 numpy.array([5,6]),
>                 numpy.array([0.5,0.5])
>         ],
> …
> }
>
> I’ve tried:
> -       manager proxy objects, but the object was so big that low-level code threw an exception due to format and monkey-patching wasn’t successful.
> -       Redis, which was far too slow due to setting up connections and data conversion etc.
> -       Numpy rec arrays + memory mapping, but there is a restriction that the numpy arrays in each “column” must be of fixed and same size.
> -       I looked at PyTables, which may be a solution, but seems to have a very steep learning curve.
> -       I haven’t tried SQLite3, but I am worried about the time it takes to query the DB for a sequential ID, and then translate byte arrays.
>
> Any ideas? I greatly appreciate any guidance you can provide.
>
> Thanks,
> Ryan
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion