Checking for unique fields: performance.

Sun Apr 20 04:12:09 EDT 2008

En Fri, 18 Apr 2008 12:23:08 -0300, Shawn Milochik <Shawn at Milochik.com> escribió:

> I'm looping through a tab-delimited file to gather statistics on fill rates,
> lengths, and uniqueness.
>
> For the uniqueness, I made a dictionary with keys which correspond to the
> field names. The values were originally lists, where I would store values
> found in that field. Once I detected a duplicate, I deleted the entire
> element from the dictionary. Any which remained by the end are considered
> unique. Also, if the value was empty, the dictionary element was deleted and
> that field considered not unique.
>
> A friend of mine suggested changing that dictionary of lists into a
> dictionary of dictionaries, for performance reasons. As it turns out, the
> speed increase was ridiculous -- a file which took 42 minutes to run dropped
> down to six seconds.

A dictionary with keys is perfectly reasonable. But a *list* of values has to be searched linearly for every value: a O(n) process. As your friend suggested, searching a dictionary requires O(1) time. A set is even better in this case, because you don't have any use for the values in the inner dictionary (sets and dictionaries are very similar in the implementation).

> Here is the excerpt of the bit of code which checks for uniqueness. It's
> fully functional, so I'm just looking for any suggestions for improving it
> or any comments. Note that fieldNames is a list containing all column
> headers.
>
>             #check for unique values
>             #if we are still tracking that field (we haven't yet
>             #found a duplicate value).
>             if fieldUnique.has_key(fieldNames[index]):
>                 #if the current value is a duplicate
>                 if fieldUnique[fieldNames[index]].has_key(value):
>                     #sys.stderr.write("Field %s is not unique. Found a
> duplicate value after checking %d values.\n" % (fieldNames[index], lineNum))
>                     #drop the whole hash element
>                     fieldUnique.__delitem__(fieldNames[index])
>                 else:
>                     #add the new value to the list
>                     fieldUnique[fieldNames[index]][value] = 1
>

- Instead of using fieldNames[index] all along the place, save it in a variable.
- Your code doesn't show it, but if you are traversing the fieldNames vector like this:
     for index in range(len(fieldNames)):
         ... using fieldNames[index] ...
it's better to use:
     for fieldName in fieldNames:
         ... using fieldName all along the place ...

- Instead of a.has_key(b), use: b in a
- Instead of fieldUnique.__delitem__(fieldNames[index]) use: del fieldUnique[fieldName]
   (In normal code, __special__ methods are never used)

             #check for unique values
             #if we are still tracking that field (we haven't yet
             #found a duplicate value).
             if fieldName in fieldUnique:
                 #if the current value is a duplicate
                 if value in fieldUnique[fieldName]:
                     #drop the whole field as it's not unique
                     del fieldUnique[fieldName]
                 else:
                     #add the new value to the set
                     fieldUnique[fieldName].add(value)
             else:
                 # use a set to store the unique values
                 fieldUnique[fieldName] = set([value])

-- 
Gabriel Genellina