Checking for unique fields: performance.
Gabriel Genellina
gagsl-py2 at yahoo.com.ar
Sun Apr 20 04:12:09 EDT 2008
En Fri, 18 Apr 2008 12:23:08 -0300, Shawn Milochik <Shawn at Milochik.com> escribió:
> I'm looping through a tab-delimited file to gather statistics on fill rates,
> lengths, and uniqueness.
>
> For the uniqueness, I made a dictionary with keys which correspond to the
> field names. The values were originally lists, where I would store values
> found in that field. Once I detected a duplicate, I deleted the entire
> element from the dictionary. Any which remained by the end are considered
> unique. Also, if the value was empty, the dictionary element was deleted and
> that field considered not unique.
>
> A friend of mine suggested changing that dictionary of lists into a
> dictionary of dictionaries, for performance reasons. As it turns out, the
> speed increase was ridiculous -- a file which took 42 minutes to run dropped
> down to six seconds.
A dictionary with keys is perfectly reasonable. But a *list* of values has to be searched linearly for every value: a O(n) process. As your friend suggested, searching a dictionary requires O(1) time. A set is even better in this case, because you don't have any use for the values in the inner dictionary (sets and dictionaries are very similar in the implementation).
> Here is the excerpt of the bit of code which checks for uniqueness. It's
> fully functional, so I'm just looking for any suggestions for improving it
> or any comments. Note that fieldNames is a list containing all column
> headers.
>
> #check for unique values
> #if we are still tracking that field (we haven't yet
> #found a duplicate value).
> if fieldUnique.has_key(fieldNames[index]):
> #if the current value is a duplicate
> if fieldUnique[fieldNames[index]].has_key(value):
> #sys.stderr.write("Field %s is not unique. Found a
> duplicate value after checking %d values.\n" % (fieldNames[index], lineNum))
> #drop the whole hash element
> fieldUnique.__delitem__(fieldNames[index])
> else:
> #add the new value to the list
> fieldUnique[fieldNames[index]][value] = 1
>
- Instead of using fieldNames[index] all along the place, save it in a variable.
- Your code doesn't show it, but if you are traversing the fieldNames vector like this:
for index in range(len(fieldNames)):
... using fieldNames[index] ...
it's better to use:
for fieldName in fieldNames:
... using fieldName all along the place ...
- Instead of a.has_key(b), use: b in a
- Instead of fieldUnique.__delitem__(fieldNames[index]) use: del fieldUnique[fieldName]
(In normal code, __special__ methods are never used)
#check for unique values
#if we are still tracking that field (we haven't yet
#found a duplicate value).
if fieldName in fieldUnique:
#if the current value is a duplicate
if value in fieldUnique[fieldName]:
#drop the whole field as it's not unique
del fieldUnique[fieldName]
else:
#add the new value to the set
fieldUnique[fieldName].add(value)
else:
# use a set to store the unique values
fieldUnique[fieldName] = set([value])
--
Gabriel Genellina
More information about the Python-list
mailing list