[Tutor] Cloning lists to avoid alteration of original list [example of refactoring]

Tue Jan 13 19:54:27 EST 2004

On Tue, 13 Jan 2004, Alistair McGowan wrote:

> class mergeTaxa: # geodispersal of taxon lists from colliding areas
> 	def __init__(self,m,y,clone1,clone2):
> 		self.mergetaxa=[] #empty list
> 		self.mergetaxa.append (name)#id of new merged area
> 		one = clone1
> 		two = clone2

[code cut]

Hi Alistair,

The code is a little long; it's a little hard for me to figure out what's
happening.  Near the bottom of this message, let's talk about how to
refactor parts of it --- that may help us make the intent of the code more
clear.

Let's take a look at the parameters that the initializer takes in:

> 	def __init__(self,m,y,clone1,clone2):

What is 'm', 'y', 'clone1', and 'clone2'?  Those parameters need to be
documented, especially because the names 'm' and 'y' aren't really that
descriptive.

Also, it looks like you're depending on some global variables.  For
example, the reference to 'name' in:

> 		self.mergetaxa.append (name)#id of new merged area

appears to depend on a 'name' variable that's outside the class
definition.  Is this what you mean?

> 		one = clone1
> 		two = clone2

Ok, I think we've found a problem here.  The names 'one' and 'two' are now
aliases for the lists.  'one' and 'two' are not copies.  To demonstrate:

###
>>> mylist1 = range(10)
>>> mylist2 = mylist1
>>> mylist1
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> mylist2
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> mylist1.pop()
9
>>> mylist1
[0, 1, 2, 3, 4, 5, 6, 7, 8]
>>> mylist2
[0, 1, 2, 3, 4, 5, 6, 7, 8]
###

This issue may be causing problems the your code.

If we want to copy a list, we need to state that explicitely:

###
>>> from copy import copy
>>> mylist1 = range(10)
>>> mylist2 = copy(mylist1)
>>> mylist1
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> mylist2
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> mylist1.pop()
9
>>> mylist1
[0, 1, 2, 3, 4, 5, 6, 7, 8]
>>> mylist2
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
###

By the way, I want to concentrate on one of the inner loops in the
initializer:

> 		for e in range (0, len(one)):
> 			x = one[e] #extracts info for each taxon in area
> 			a = one[e][0]#gets the id #
> 			if one [e][2]==0: #checks if taxon is alive
> 				one[e][3]= 4
> 				print clone1
> 				self.mergetaxa.append(x)
> 				temp.append(a)
> 			else:
> 				pass

>From the code, I am now guessing that clone1 and clone2 are lists of
taxons.  It also looks like each taxon itself is a list --- the code is
using a list as an ad-hoc data structure.

Let's define a few accessor helper functions to make the loop above easier
to read:

###
def get_taxon_id(taxon):
    return taxon[0]

def is_taxon_alive(taxon):
    return taxon[2] == 0
###

The advantage of using these "accessors" is that it helps to document the
code.  Now the loop can be written as:

###
		for e in range (0, len(one)):
			x = one[e] #extracts info for each taxon in area
			a = get_taxon_id(one[e])
			if is_taxon_alive(one[e]):
				one[e][3]= 4
				print clone1
				self.mergetaxa.append(x)
				temp.append(a)
			else:
				pass
###

Another improvement: we can also iterate across lists directly.  That is,
instead of:

    for e in range(len(one)):
        element = one[e]
        ...

we can more directly iterate on the elements themselves:

    for element in one:
        ...

With these changes, the inner loop can be rewritten as:

###
for taxon in clone1:
    if is_taxon_alive(taxon):
        taxon[4] = 4                 ## What is this doing?
        self.mergetaxa.append(taxon)
        temp.append(get_taxon_id(taxon))
###

And we can summarize the effect of this loop as this: self.mergetaxa
contains all of the alive taxons.  'temp' contains all of those alive
taxon's ids --- it might be good to rename 'temp' to something that
describes those ids.

Anway, let's give this process a name --- let's call it
'collect_alive_taxons()':

###
def collect_alive_taxons(taxons):
    """Given a list of taxons, returns a list of the alive taxons, as well
       as their ids."""
    alive_taxons = []
    alive_taxon_ids = []
    for taxon in taxons:
        if is_taxon_alive(taxon):
            taxon[4] = 4                 ## What is this doing?  What is
                                         ## the fourth field?
            alive_taxons.append(taxon)
            alive_taxon_ids.append(get_taxon_id(taxon))
    return (alive_taxons, alive_taxon_ids)
###

Or something like this... *grin* The important thing is to give a good
name to this process.

If we've refactored the inner loop as a separate function or method, the
program might look something like this:

###

class mergeTaxa: # geodispersal of taxon lists from colliding areas
    def __init__(self, m, y, clone1, clone2):
        self.mergetaxa=[] #empty list
        self.mergetaxa.append (name)#id of new merged area
        alive_taxon_ids = self.collect_alive_taxons_in_mergetaxa(clone1)
        ### ... [second loop and the remainder of the initializer follows
        ###      this]

    def collect_alive_taxons_in_mergetaxa(self, taxons):
        """Given a list of taxons, adds the alive taxons to
           self.mergetaxa.  Returns a list of the alive taxon ids."""
        alive_taxon_ids = []
        for taxon in taxons:
           if is_taxon_alive(taxon):
               taxon[4] = 4                    ## ??
               self.mergetaxa.append(taxon)
               alive_taxon_ids.append(get_taxon_id(taxon))
        return alive_taxon_ids

def get_taxon_id(taxon):
    """Gets the taxon id out of a taxon record."""
    return taxon[0]

def is_taxon_alive(taxon):
    """Returns true if the taxon is alive."""
    return taxon[2] == 0
###

The initializer still needs work.  But I hope this shows that this process
of "refactoring" can improve the design of the code.

I hope this helps!