[Numpy-discussion] About the npz format

Valentin Haenel valentin at haenel.co
Fri Jul 4 09:49:54 EDT 2014


sorry, for the top-post, but should we add this as an issue on the
github tracker? I'd like to revisit it this summer.

V-

* Julian Taylor <jtaylor.debian at googlemail.com> [2014-04-18]:
> On 18.04.2014 18:29, Valentin Haenel wrote:
> > Hi,
> > 
> > * Valentin Haenel <valentin at haenel.co> [2014-04-17]:
> >> * Valentin Haenel <valentin at haenel.co> [2014-04-17]:
> >>> * Julian Taylor <jtaylor.debian at googlemail.com> [2014-04-17]:
> >>>> On 17.04.2014 21:30, onefire wrote:
> >>>>> Thanks for the suggestion. I did profile the program before, just not
> >>>>> using Python.
> >>>>
> >>>> one problem of npz is that the zipfile module does not support streaming
> >>>> data in (or if it does now we aren't using it).
> >>>> So numpy writes the file uncompressed to disk and then zips it which is
> >>>> horrible for performance and disk usage.
> >>>
> >>> As a workaround may also be possible to write the temporary NPY files to
> >>> cStringIO instances and then use ``ZipFile.writestr`` with the
> >>> ``getvalue()`` of the cStringIO object. However that approach may
> >>> require some memory. In python 2.7, for each array: one copy inside the
> >>> cStringIO instance and then another copy of when calling getvalue on the
> >>> cString, I believe.
> >>
> >> There is a proof-of-concept implementation here:
> >>
> >> https://github.com/esc/numpy/compare/feature;npz_no_temp_file
> > 
> > Anybody interested in me fixing this up (unit tests, API, etc..) for
> > inclusion?
> > 
> 
> I wonder if it would be better to instead use a fifo to avoid the memory
> doubling. Windows probably hasn't got them (exposed via python) but one
> can slap a platform check in front.
> attached a proof of concept without proper error handling (which is
> unfortunately the tricky part)

> >From 472b4c0a44804b65d0774147010ec7a931a1c52d Mon Sep 17 00:00:00 2001
> From: Julian Taylor <jtaylor.debian at googlemail.com>
> Date: Thu, 17 Apr 2014 23:01:47 +0200
> Subject: [PATCH] use a pipe for savez
> 
> ---
>  numpy/lib/npyio.py | 25 +++++++++++--------------
>  1 file changed, 11 insertions(+), 14 deletions(-)
> 
> diff --git a/numpy/lib/npyio.py b/numpy/lib/npyio.py
> index 98b4b6e..baafa9d 100644
> --- a/numpy/lib/npyio.py
> +++ b/numpy/lib/npyio.py
> @@ -585,22 +585,19 @@ def _savez(file, args, kwds, compress):
>      zipf = zipfile_factory(file, mode="w", compression=compression)
>  
>      # Stage arrays in a temporary file on disk, before writing to zip.
> -    fd, tmpfile = tempfile.mkstemp(suffix='-numpy.npy')
> -    os.close(fd)
> -    try:
> +    import threading
> +    with tempfile.TemporaryDirectory() as td:
> +        fifoname = os.path.join(td, "fifo")
> +        os.mkfifo(fifoname)
>          for key, val in namedict.items():
>              fname = key + '.npy'
> -            fid = open(tmpfile, 'wb')
> -            try:
> -                format.write_array(fid, np.asanyarray(val))
> -                fid.close()
> -                fid = None
> -                zipf.write(tmpfile, arcname=fname)
> -            finally:
> -                if fid:
> -                    fid.close()
> -    finally:
> -        os.remove(tmpfile)
> +            def mywrite(pipe, val):
> +                with open(pipe, "wb") as wpipe:
> +                    format.write_array(wpipe, np.asanyarray(val))
> +            t = threading.Thread(target=mywrite, args=(fifoname, val))
> +            t.start()
> +            zipf.write(fifoname, arcname=fname)
> +            t.join()
>  
>      zipf.close()
>  
> -- 
> 1.9.1
> 

> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion




More information about the NumPy-Discussion mailing list