[Numpy-discussion] About the npz format

Wed Apr 16 20:57:36 EDT 2014

Valentin Haenel, Bloscpack definitely looks interesting but I need to take
a careful look first. I will let you know if I like it. Thanks for the
suggestion!

I think you and Nathaniel Smith misunderstood my questions (my fault, since
I did not explain myself well!).
First, Numpy's savez will not do any compression by default. It will simply
store the npy file normally. The documentation suggests so and I can open
the resulting file to confirm it.
Also, if you run the commands that I specified in my previous post, you can
see that the resulting files have sizes 400000080 (x.npy) and 400000194
(x.npz). The npy header takes 80 bytes (it actually needs less than that,
but it is padded to be divisible by 16). The npz file that saves the same
array takes 114 extra bytes (for the zip file metadata), so the space
overhead is pretty small.
What I cannot understand is why savez takes more than 10 times longer than
saving the data to a npy file. The only reason that I could come up with
was the computation of the crc32.
BUT it might be more than this...
This afternoon I found out about this Julia package (
https://github.com/fhs/NPZ.jl) to manipulate Numpy files. I did a few tests
and it seems to work correctly. It becomes interesting when I do the
npy-npz comparison using Julia.
Here is the code that I used:

using NPZ

function write_npy(x)

  tic()

  npzwrite("data.npy", x)

  toc()

end

function write_npz(x)

  tic()

  npzwrite("data.npz", (ASCIIString => Any)["data" => x])

  toc()

end

x = linspace(1, 10, 50000000)

write_npy(x)       # this prints:  elapsed time: 0.417742163 seconds

write_npz(x)       # this prints:  elapsed time: 0.882226675 seconds

The Julia timings (tested with Julia 0.3) are closer to what I would
expect. Notice that the time to save the npy file is very similar to the
one that I got with Numpy's save function (see my previous post), but the
"npz overhead" only adds half a second.

So now I think there are two things going on:
1) It is wasteful to compute the crc32. At a minimum I would like to either
have the option to choose a different, faster checksum (like adler32) or to
turn that off (I prefer the second option, because if I am worried about
the integrity of the data, I will likely compute the sha512sum of the
entire file anyway).
2) The Python implementation is inefficient (to be honest, I just found out
about the Julia package and I cannot guarantee anything about its quality,
but if I compute a crc32 from 0.5 GB of data from C code, it does takes
less than a second!). My guess is that the problem is in the zip module,
but like I said before, I do not know the details of what it is doing.

Let me know what you think.

Gilberto

On Wed, Apr 16, 2014 at 5:03 PM, Nathaniel Smith <njs at pobox.com> wrote:

> crc32 extremely fast, and I think zip might use adler32 instead which is
> even faster. OTOH compression is incredibly slow, unless you're using one
> of the 'just a little bit of compression' formats like blosc or lzo1. If
> your npz files are compressed then this is certainly the culprit.
>
> The zip format supports storing files without compression. Maybe what you
> want is an option to use this with .npz?
>
> -n
> On 16 Apr 2014 20:26, "onefire" <onefire.myself at gmail.com> wrote:
>
>> Hi all,
>>
>> I have been playing with the idea of using Numpy's binary format as a
>> lightweight alternative to HDF5 (which I believe is the "right" way to do
>> if one does not have a problem with the dependency).
>>
>> I am pretty happy with the npy format, but the npz format seems to be
>> broken as far as performance is concerned (or I am missing obvious!). The
>> following ipython session illustrates the issue:
>>
>> ln [1]: import numpy as np
>>
>> In [2]: x = np.linspace(1, 10, 50000000)
>>
>> In [3]: %time np.save("x.npy", x)
>> CPU times: user 40 ms, sys: 230 ms, total: 270 ms
>> Wall time: 488 ms
>>
>> In [4]: %time np.savez("x.npz", data = x)
>> CPU times: user 657 ms, sys: 707 ms, total: 1.36 s
>> Wall time: 7.7 s
>>
>> I can inspect the files to verify that they contain the same data, and I
>> can change the example, but this seems to always hold (I am running Arch
>> Linux, but I've done the test on other machines too): for bigger arrays,
>> the npz format seems to add an unbelievable amount of overhead.
>>
>> Looking at Numpy's code, it looks like the real work is being done by
>> Python's zipfile  module, and I suspect that all the extra time is spent
>> computing the crc32. Am I correct in my assumption (I am not familiar with
>> zipfile's internals)? Or perhaps I am doing something really dumb and there
>> is an easy way to speed things up?
>>
>> Assuming that I am correct, my next question is: why compute the crc32 at
>> all? I mean, I know that it is part of what defines a "zip file", but is it
>> really necessary for a npz file to be a (compliant) zip file? If, for
>> example, I open the resulting npz file with a hex editor, and insert a
>> bogus crc32, np.load will happily load the file anyway (Gnome's Archive
>> Manager will do the same)  To me this suggests that the fact that npz files
>> are zip files is not that important. .
>>
>> Perhaps, people think that the ability to browse arrays and extract
>> individual ones like they would do with a regular zip file is really
>> important, but reading the little documentation that I found, I got the
>> impression that npz files are zip files just because this was the easiest
>> way to have multiple arrays in the same file. But my main point is: it
>> should be fairly simple to make npz files much more efficient with simple
>> changes like not computing checksums (or using a different algorithm like
>> adler32).
>>
>> Let me know what you think about this. I've searched around the internet,
>> and on places like Stackoverflow, it seems that the standard answer is: you
>> are doing it wrong, forget Numpy's format and start using hdf5! Please do
>> not give that answer.  Like I said in the beginning, I am well aware of
>> hdf5 and I use it on my "production code" (on C++). But I believe that
>> there should be a lightweight alternative (right now, to use hdf5 I need to
>> have installed the C library, the C++ wrappers, and the h5py library to
>> play with the data using Python, that is a bit too heavy for my needs). I
>> really like Numpy's format (if anything, it makes me feel better knowing
>> that it is
>> so easy to reverse engineer it, while the hdf5 format is very
>> complicated), but the (apparent) poor performance of npz files if a deal
>> breaker.
>>
>> Gilberto
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140416/5b6d8481/attachment.html>