[Numpy-discussion] Behaviour of copy for structured dtypes with gaps

Marten van Kerkwijk m.h.vankerkwijk at gmail.com
Fri Apr 12 16:10:05 EDT 2019


It may be relevant at this point to mention that the padding bytes do *not*
get copied - so you get a blob with possibly quite a lot of uninitialized
data. If anything, that seems a recipe for unexpected results. Are there
non-contrived examples where you would *want* this uninitialized blob?

Francecz: looking at the PyTables issue, it seems to have been more about
not automatically removing padding (or at least having the option of not
doing so), rather than what the behaviour of a copy should be. I'm
definitely *not* suggesting that one shouldn't be able to add padding: that
would still be as simple as `array.astype(padded_dtype)`. Although in your
case that would still lead to hdf5 files that are all different, since the
padding is uninitialized memory. Which seems far from nice...

Indeed, the "astype" example suggests perhaps a different way to phrase the
issue: should be copy behave as `astype(unpadded_dtype, copy=True)` or
should it just be `astype(same_dtype, copy=True)`. Note that the former is
tricky to do -- have to get the unpadded dtype -- which the latter is easy.

My sense still is that the option that will be the least surprising to most
users is a copy that is most compact and does not have any padding. For
instance, think of someone loading a large binary file as a numpy memmap:
if they take one item, they can have a regular array with `mm['a'].copy()`,
but if they take two, `mm[['a', 'z']].copy()`, oops, out of memory! (Also,
remember that this used to work; that the user gets out of memory is I
think a serious regression, so the pain better be worth it!.)

-- Marten
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20190412/d471490b/attachment-0001.html>


More information about the NumPy-Discussion mailing list