[Numpy-discussion] R: fast numpy.fromfile skipping data chunks

Wed Mar 13 10:21:50 EDT 2013

I see that pytables deals with hdf5 data. It would be very nice if the data were in such a standard format, but that is not the case, and that can't be changed.

________________________________________
Da: numpy-discussion-bounces at scipy.org [numpy-discussion-bounces at scipy.org] per conto di Frédéric Bastien [nouiz at nouiz.org]
Inviato: mercoledì 13 marzo 2013 15.03
A: Discussion of Numerical Python
Oggetto: Re: [Numpy-discussion] fast numpy.fromfile skipping data chunks

Hi,

I would suggest that you look at pytables[1]. It use a different file
format, but it seam to do exactly what you want and give an object
that have a very similar interface to numpy.ndarray (but fewer
function). You would just ask for the slice/indices that you want and
it return you a numpy.ndarray.

HTH

Frédéric

[1] http://www.pytables.org/moin

On Wed, Mar 13, 2013 at 9:54 AM, Nathaniel Smith <njs at pobox.com> wrote:
> On Wed, Mar 13, 2013 at 1:45 PM, Andrea Cimatoribus
> <Andrea.Cimatoribus at nioz.nl> wrote:
>> Hi everybody, I hope this has not been discussed before, I couldn't find a solution elsewhere.
>> I need to read some binary data, and I am using numpy.fromfile to do this. Since the files are huge, and would make me run out of memory, I need to read data skipping some records (I am reading data recorded at high frequency, so basically I want to read subsampling).
>> At the moment, I came up with the code below, which is then compiled using cython. Despite the significant performance increase from the pure python version, the function is still much slower than numpy.fromfile, and only reads one kind of data (in this case uint32), otherwise I do not know how to define the array type in advance. I have basically no experience with cython nor c, so I am a bit stuck. How can I try to make this more efficient and possibly more generic?
>
> If your data is stored as fixed-format binary (as it seems it is),
> then the easiest way is probably
>
> # Exploit the operating system's virtual memory manager to get a
> "virtual copy" of the entire file in memory
> # (This does not actually use any memory until accessed):
> virtual_arr = np.memmap(path, np.uint32, "r")
> # Get a numpy view onto every 20th entry:
> virtual_arr_subsampled = virtual_arr[::20]
> # Copy those bits into regular malloc'ed memory:
> arr_subsampled = virtual_arr_subsampled.copy()
>
> (Your data is probably large enough that this will only work if you're
> using a 64-bit system, because of address space limitations; but if
> you have data that's too large to fit into memory, then I assume
> you're using a 64-bit system anyway...)
>
> -n
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion at scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion