in place list modification necessary? What's a better idiom?

Tue Apr 7 01:43:45 EDT 2009

MooMaster wrote:
> So I'm reading in values from a file, and for each column I need to
> dynamically discover the range of possible values it can take and
> quantize if necessary. This is the solution I've come up with:
>
> <code>
> def createInitialCluster(fileName):
>     #get the data from the file
>     points = []
>     with open(fileName, 'r') as f:
>         for line in f:
>             points.append(line.rstrip('\n'))
>     #clean up the data
>     fixedPoints = []
>     for point in points:
>         dimensions = [quantize(i, points, point.split(",").index(i))
> for i in point.split(",")]
>         print dimensions
>         fixedPoints.append(Point(dimensions))
>     #return an initial cluster of all the points
>     return Cluster(fixedPoints)
>
> def quantize(stringToQuantize, pointList, columnIndex):
>     #if it's numeric, no need to quantize
>     if(isNumeric(stringToQuantize)):
>         return float(stringToQuantize)
>     #first we need to discover all the possible values of this column
>     domain = []
>     for point in pointList:
>         domain.append(point.split(",")[columnIndex])
>     #make it a set to remove duplicates
>     domain = list(Set(domain))
>     #use the index into the domain as the number representing this
> value
>     return float(domain.index(stringToQuantize))
>
> #harvested from http://www.rosettacode.org/wiki/IsNumeric#Python
> def isNumeric(string):
>     try:
>         i = float(string)
>     except ValueError:
>         return False
>     return True

Big problem with this.  I'm guessing you never ran it on a really big
file yet.  Your quantize function does a lot of unnecessary work: it
rebuilds the list of indices for every single line, every single non-
numeric entry, in fact.  (Tech-speak: that is N^2 behavior in both the
number of lines and number of non-numeric columns.  Not good.)  This
will work ok for a small file, but will take forever on a large file
(perhaps literally).

So, forgetting about column indexing for a moment, we can improve this
vastly simply by generating the list of indices once.  Do that in a
separete function, have it return the list, and then pass that list to
the quantize function.  So replace the midportion of your
createInitialCluster function with something like this:

    ....
    for i in xrange(len(points[0])): # number of columns
        column_indices.append(quantize_column(points,i))
    fixedPoints = []
    for point in points:
        dimensions = [quantize(s, column_indices[i], point.split
(",").index(i))
                for (i,s) in enumerate(point.split(","))] # need index
as well as entry here
        print dimensions
        fixedPoints.append(Point(dimensions))
    ....

And the two functions would be something like this:

def quantize_columns(point_list,column_index):
    # this assumes if the first column is numeric the whole column
would be
    if(isNumeric(point_list[0][column_index])):
        return None # don't quantize this column
    #first we need to discover all the possible values of this column
    domain = []
    for point in point_list:
        domain.append(point.split(",")[column_index])
    #make it a set to remove duplicates
    return list(set(domain))

def quantize(i,domain,s):
    if domain is None:
        return float(s)
    return float(domain.index(s))

This (once debugged :) will run much, much faster on a large dataset.

Now back to your question.

> It works, but it feels a little ugly, and not exactly Pythonic. Using
> two lists I need the original point list to read in the data, then the
> dimensions one to hold the processed point, and a fixedPoint list to
> make objects out of the processed data. If my dataset is in the order
> of millions, this'll nuke the memory. I tried something like:
>
> for point in points:
>    point = Point([quantize(i, points, point.split(",").index(i)) for i
> in point.split(",")])

> but when I print out the points afterward, it doesn't keep the
> changes.

It's because the order of items in a set is undefined.  The order of
the items in list(set(["a","b","c","d"])) might be very different from
the order in list(set(["a","b","c","d","e"])).  You were passing
quantize incomplete an incomplete list of points, so as the points
grew, the items in the set changed, and it messed up the order.  In
fact,  you should never rely on the order being the same, even if you
created the set with the very same arguments.

What you are trying to do should be done with dictionaries: create a
dict that maps a value to a number.

Now, based on your quantize function, it would seem that the number
associated with the value is arbitrary (doesn't matter what it is, as
long as it's distinct), so it isn't necessary to read the whole csv
file in before assigning numbers; just build the dict as you go.

I suggest collections.defaultdict for this task.  It's like a regular
dict, but it creates a new value any time you access a key that
doesn't exist, a perfect solution to your task.  We'll pass it a
function that generates a different index each time it's called.
(This is probably too advanced for you, but oh well, it's such a cool
trick.)

import collections
import itertools

def createInitialCluster(fileName):
    fixedPoints = []
    # quantization is a dict that assigns sequentially-increasing
numbers
    # to values when reading keys that don't yet exit
    quantization = defaultdict.collections(itertools.count().next)
    with open(fileName, 'r') as f:
        for line in f:
            dimensions = []
            for s in line.rstrip('\n').split(","):
                if isNumeric(s):
                    dimensions.append(float(s))
                else:
                    dimensions.append(float(quantization[s]))
            fixedPoints.append(Point(dimensions))
    return Cluster(fixedPoints)

A couple general pointers:

* Don't ever use i to represent a string.  Programmers expect i to be
an integer.  i,j,k,l,m, and n should be integers, in fact.  u,v,w,x,y,
and z should be floats.  You should stick to this convention whenever
possible, but definitely never use i for anything but an integer.

* set is built into Python now; unless you're using an older version
(2.3 I think) you should use set instead of Set.

* The Python style guide (q.g.) recommends that variables use names
such as column_index rather than columnIndex.  The world won't end if
you don't follow it but if you want to be Pythonic that's how.

Carl Banks