Checking for unique fields: performance.

Fri Apr 18 11:23:08 EDT 2008

I'm looping through a tab-delimited file to gather statistics on fill rates,
lengths, and uniqueness.

For the uniqueness, I made a dictionary with keys which correspond to the
field names. The values were originally lists, where I would store values
found in that field. Once I detected a duplicate, I deleted the entire
element from the dictionary. Any which remained by the end are considered
unique. Also, if the value was empty, the dictionary element was deleted and
that field considered not unique.

A friend of mine suggested changing that dictionary of lists into a
dictionary of dictionaries, for performance reasons. As it turns out, the
speed increase was ridiculous -- a file which took 42 minutes to run dropped
down to six seconds.

Here is the excerpt of the bit of code which checks for uniqueness. It's
fully functional, so I'm just looking for any suggestions for improving it
or any comments. Note that fieldNames is a list containing all column
headers.

            #check for unique values
            #if we are still tracking that field (we haven't yet
            #found a duplicate value).
            if fieldUnique.has_key(fieldNames[index]):
                #if the current value is a duplicate
                if fieldUnique[fieldNames[index]].has_key(value):
                    #sys.stderr.write("Field %s is not unique. Found a
duplicate value after checking %d values.\n" % (fieldNames[index], lineNum))
                    #drop the whole hash element
                    fieldUnique.__delitem__(fieldNames[index])
                else:
                    #add the new value to the list
                    fieldUnique[fieldNames[index]][value] = 1
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20080418/ed5e8edd/attachment.html>