Avoiding referencing (was: Python sequences reference - problem?)

Fri Sep 20 05:07:16 EDT 2002

Thorsten Kampe wrote:

> * Alex Martelli
>> Just remembed that EVERYTHING works by reference, and what
>> could possibly be the problem...?
> 
> The problem is that we newbies sometimes don't want referencing:

Then you explicitly ask for a copy, just like an experienced
programmer would.

> #v+
> def part(seq, indices):
>     """ chop seq by indices """
>     partition = []
>     
>     for slice in indices:
>         partition.append(seq[:slice])
>         del seq[:slice]
>     partition.append(seq)
>     return partition
> #v-

If you don't want your function to affect its argument seq,
you start by making a copy.  Since you'd also like to ensure
you work with a list (whatever sequence or other iterable
you may have been passed), as you need a mutable sequence
for the "del seq[:slice]" statement to work, you can kill
two birds with one stone -- change the start to:

def part(seq, indices):
    """ chop seq by indices """
    partition = []
    seq = list(seq)

and leave the rest unchanged.  The function would be a little
bit clearer if the names involved made it more obvious that
the sequence you receive as an argument, and the local copy
you slice up, are different things, though -- e.g.:

def part(input_seq, chop_sizes):
    """ chop input_seq into a list of lists of given sizes """
    partition = []
    seq_copy = list(input_seq)

    for size in chop_sizes:
        partition.append(seq_copy[:size])
        del seq_copy[:size]
    partition.append(seq_copy)
    return partition

Choosing names of variables for maximum clarity and readability
is always a delicate issue, of course -- here, I've tried to
change things around a bit, and edit the docstring too, to make
it clearer that the second argument is not about indices at all,
but rather about sizes (lengths, if you will) of the sub-lists.

A completely different approach would be: 

def part(input_seq, chop_sizes):
    """ chop input_seq into a list of lists of given sizes """
    partition = []
    next_start = 0

    for size in chop_sizes:
        last_start, next_start = next_start, next_start + size
        partition.append(input_seq[last_start:next_start])
    partition.append(input_seq[next_start:])
    return partition

It's hard to choose between the two approaches, but I like
this second one a tiny little bit better, personally.  I'm
not sure why I do, to be honest.  I hope it's not because
of the slight possibility that its performance is superior
(e.g., it's O(N), not O(N squared), when chop_sizes degrades
to [1]*len(input_seq)...) -- I have not measured performance
nor do I have the slightest indication that this function
part has any need for optimization.  No, I just find it more
natural to avoid changing data structures except where such
changes make things more obvious to me, I guess.  Here, with
the second approach, I can get a _slightly_ more "FP" feel
(not all that much, since partition and the scalars do
change:-), which may have something to do with my preferences!-)

The second approach does lend itself to a slight optimization,
should it be needed, by preallocating the result to its
known final length rather than relying on append:

def part(input_seq, chop_sizes):
    """ chop input_seq into a list of lists of given sizes """
    partition = [None] * (1 + len(chop_sizes)]
    next_start = 0

    for i in range(len(chop_sizes)):
        last_start, next_start = next_start, next_start + chop_sizes[i]
        partition[i] = input_seq[last_start:next_start]
    partition[-1] = input_seq[next_start:]
    return partition

but this does seem to involve a tiny loss of clarity wrt the
second one (as "for i in range(len(..." so often does, sigh),
therefore I wouldn't use it unless performance WAS important
(and, of course, after measuring the various alternatives!!!).

A more FP alternative would involve building first a list of
indices from the list of sizes, e.g. (dirty trick warning...):

    start = 0
    chop_indices = [ start for x in [0]+chop_sizes 
                     for start in (start+x,)] + [None]

and then just using it:

    return [input_seq[chop_indices[i]:chop_indices[i+1]
            for x in range(len(chop_sizes))]

but this fails the base test -- it's not as obvious and
simple and the others, mostly because Python lacks an
"expand" function to get the chop indices from the chop
sizes (whence the "for start in (start+x,)" dirty trick).

But, back to your post...:

> Obviously I did not pass 'foo' "by value" to function 'part' but by
> reference and obviously not only local variable 'seq' was modified in
> the function but "outer" variable 'foo', too.

Rather, an OBJECT was modified -- and the two names are
referring to the same object (as you didn't ask for a copy).

> So how to deal with that?

By asking for a copy.

> 1. "Just don't call function 'part' by with a variable, call it by
> value - part([11, 22, 33, 44, 55, 66, 77], [2, 3]) - and everything's
> okay".
> 
> Obviously no solution to me.

Obviously!

> 2. "If you don't want referencing, call it with a copy of 'foo' like:
> part(foo[:], [2, 3])"
> 
> Yes, but how do I know in advance that 'part' modifies 'foo'? I might
> not be the author of 'part'.

Indeed, part should not modify its arguments UNLESS it says it does
in its docstring.  Unless there are very good reasons for a function
to modify its arguments (in which case the function should normally
return None -- mixing arg modification and significant return values
can be confusing!!!), avoiding such mods is the default case.

> 3. "Rewrite function 'part' so that it works with a tuple as argument.
> Thereby you're avoiding all possible modifying of existing objects."
> 
> Yes, but I'm also losing the "extended" possibilities of lists:
> "list.append", "list.reverse", "list.index" etc. My code will be twice
> as long as it would be with list methods.

I think it might be nice for part to accept as arguments sequences
that are not lists -- as you see in the first version I propose,
making a copy and ensuring you have a list can be fruitfully
combined in a single call to 'list'.  As you see in the second
version, working directly with the argument, without modifying
it, doesn't expand the code size AT ALL in this specific case,
anyway -- but the first version is slightly more general, in that
it will accept ANY argument that's acceptable to built-in 'list'
(e.g., a file -- list(afileobject) is like afileobject.readlines()),
while the second version does require a sequence (which must
accept being sliced AND supply copies when sliced -- a Numeric
array would accept being sliced, BUT would supply slices that
are arrays sharing data with the original array... this MAY be
a problem in some cases, although performance will surely be
enhanced by avoiding copies, if the data be truly huge).

> Is there a way to "bypass" referencing _without_ disadvantages? I'm

Absolutely not.  Making a copy takes time: there is absolutely
no way around this!  Therefore, there's no way to "bypass
_without_" ANY disadvantages at all: as a minimum, the time
and memory you spend for the copy are a disadvantage in some
cases.  But, of course, you DO pay those prices _whenever_ you
use a language with copy semantics.  That's part of why passing
of array arguments doesn't use copy semantics in most languages --
in Fortran because all argument passing is by reference (while
all assignment is by copy), in C because arrays in particular
"decay" to pointers when mentioned.  Implicitly copying huge gobs
of data each and every time an array is passed as an argument
would be an unmitigated performance disaster for most cases.

> sure there is a "canonical" way to do this in Python!

The canonical way to copy any object is function copy of module
copy (assuming a shallow copy suffices, as is often the case).

But often when you copy you ALSO want something else -- like
here, for example, in the first version (turn any sequence into
a list, as well as copying it).  Thus, don't blindly follow the
"canonical" way: THINK about what you want and how best to
obtain it -- sometimes copy.copy is OK, sometimes an all-sequence
slice x[:], but often calls to built-ins list or tuple, depending
on what, exactly, you're trying to accomplish with your copy.

Alex