[Numpy-discussion] copy on demand

Alexander Schmolck a.schmolck at gmx.net
Wed Jun 12 15:51:02 EDT 2002


Rick White <rlw at stsci.edu> writes:

> Here is what I see as the fundamental problem with implementing slicing
> in numarray using copy-on-demand instead views.
> 
> Copy-on-demand requires the maintenance of a global list of all the
> active views associated with a particular array buffer.  Here is a
> simple example:
> 
>     >>> a = zeros((5000,5000))
>     >>> b = a[49:51,50]
>     >>> c = a[51:53,50]
>     >>> a[50,50] = 1
> 
> The assignment to a[50,50] must trigger a copy of the array b;
> otherwise b also changes.  On the other hand, array c does not need to
> be copied since its view does not include element 50,50.  You could
> instead copy the array a -- but that means copying a 100 Mbyte array
> while leaving the original around (since b and c are still using it) --
> not a good idea!

Sure, if one wants do perform only the *minimum* amount of copying, things can
get rather tricky, but wouldn't it be satisfactory for most cases if attempted
modification of the original triggered the delayed copying of the "views"
(lazy copies)?  In those cases were it isn't satisfactory the user could still
explicitly create real (i.e. alias-only) views.

> 
> The bookkeeping can get pretty messy (if you care about memory usage,
> which we definitely do).  Consider this case:
> 
>     >>> a = zeros((5000,5000))
>     >>> b = a[0:-10,0:-10]
>     >>> c = a[49:51,50]
>     >>> del a
>     >>> b[50,50] = 1
> 
> Now what happens?  Either we can copy the array for b (which means two

``b`` and ``c`` are copied and then ``a`` is deleted.

What does numarray currently keep of a if I do something like the above or:

>>> b = a.flat[::-10000]
>>> del a

? 

> copies of the huge (5000,5000) array exist, one used by c and the new
> version used by b), or we can be clever and copy c instead.
> 
> Even keeping track of the views associated with a buffer doesn't solve
> the problem of an array that is passed to a C extension and is modified
> in place.  It would seem that passing an array into a C extension would
> always require all the associated views to be turned into copies.
> Otherwise we can't guarantee that views won't be modifed.

Yes -- but only if the C extension is destructive. In that case the user might
well be making a mistake in current Numeric if he has views and doesn't want
them to be modified by the operation (of course he might know that the inplace
operation does not affect the view(s) -- but wouldn't such cases be rather
rare?). If he *does* want the views to be modified, he would obviously have to
explictly specify them as such in a copy-on-demand scheme and in the other
case he has been most likely been prevented from making an error (and can
still explicitly use real views if he knows that the inplace operation on the
original will not have undesired effects on the "views").

> 
> This kind of state information with side effects leads to a system that
> is hard to develop, hard to debug, and really messes up the behavior of
> the program (IMHO).  It is *highly* desirable to avoid it if possible.

Sure, copy-on-demand is an optimization and optmizations always mess up
things. On the other hand, some optimizations also make "nicer" (e.g. less
error-prone) semantics computationally viable, so it's often a question
between ease and clarity of the implementation vs. ease and clarity of code
that uses it. I'm not denying that too much complexity in the implementation
also aversely affects users in the form of bugs and that in the particular
case of delayed copying the user can also be affected directly by more
difficult to understand ressource usage behavior (e.g. a[0] = 1 triggering a
monstrous copying operation).

Just out of curiosity, has someone already asked the octave people how much
trouble it has caused them to implement copy on demand and whether
matlab/octave users in practice do experience difficulties because of the more
harder to predict runtime behavior (I think, like matlab, octave does
copy-on-demand)?

> 
> This is not to deny that copy-on-demand (with explicit views available
> on request) would have some desirable advantages for the behavior of
> the system.  But we've worried these issues to death, and in the end
> were convinced that slices == views provided the best compromise
> between the desired behavior and a clean implementation.

If the implementing copy-on-demand is too difficult and the resulting code
would be too messy then this is certainly a valid reason to compromise on the
current slicing behavior (especially since people like me who'd like to see
copy-on-demand are unlikely to volunteer to implement it :)


> 				Rick
> 
> ------------------------------------------------------------------
> Richard L. White    rlw at stsci.edu    http://sundog.stsci.edu/rick/
> Space Telescope Science Institute
> Baltimore, MD
> 
> 

alex

-- 
Alexander Schmolck     Postgraduate Research Student
                       Department of Computer Science
                       University of Exeter
A.Schmolck at gmx.net     http://www.dcs.ex.ac.uk/people/aschmolc/





More information about the NumPy-Discussion mailing list