[Numpy-discussion] savetxt -> gzip: nondeterministic because of time stamp

Andras Deak deak.andris at gmail.com
Wed Apr 14 16:57:33 EDT 2021


On Wed, Apr 14, 2021 at 10:36 PM Joachim Wuttke <j.wuttke at fz-juelich.de> wrote:
>
> If argument fname of savetxt(fname, X, ...) ends with ".gz" then
> array X is not only converted to text, but also compressed using gzip.
>
> The format gzip [1] has a timestamp. The Python module gzip.py [2]
> sets the timestamp according to an optional constructor argument
> "mtime". By default, the current time is used.
>
> This makes the file written by savetxt(*.gz, ...) non-deterministic.
> This is unexpected and confusing in a numerics context.

Related: same for np.savez https://github.com/numpy/numpy/issues/9439

András


> I let different versions of a program generate *.gz files, and ran
> the "diff" util over pairs of output files to check whether any bit
> had changed. To my surprise, confusion, and desperation, output
> always had changed, and kept changing when I ran unchanged versions
> of my program over and again. So I learned the hard way that the
> *.gz files contain a timestamp.
>
> Regarding the module gzip.py, I submitted a pull request to improve
> description of the optional argument mtime, and hint at the possible
> choice mtime = 0 that makes outputs deterministic [3].
>
> Regarding numpy, I'd propose a bolder measure:
> To let savetxt(fname, X, ...) store exactly the same information in
> compressed and uncompressed files, always invoke gzip with mtime = 0.
>
> I would like to follow up with a pull request, but I am unable to
> find out how numpy.savetxt is invoking gzip.
>
> Joachim
>
> [1] https://www.ietf.org/rfc/rfc1952.txt
> [2] https://docs.python.org/3/library/gzip.html
> [3] https://github.com/python/cpython/pull/25410
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion


More information about the NumPy-Discussion mailing list