Merging sorted lists/iterators/generators into one stream of values...

Lasse Vågsæther Karlsen lasse at vkarlsen.no
Tue Oct 11 08:51:29 EDT 2005


Lasse Vågsæther Karlsen wrote:
> <snip>
> 
> Another idea for this method would be that in some cases I noticed that 
> it was useful to know which source each element would come from as well, 
> as well as removing duplicates from the results.
> 
<snip>

The "removing duplicates" problem would probably be best as a separate 
function and it occurs to me that perhaps Python has such a function 
already.

Again, this function would need the following criteria:

1. Be able to iterate through something other than a list
2. Yield the values, not return a list
3. Take an arbitrary cmp function to determine what is a duplicate

As sugar, perhaps also the following criteria:

- Ability to "combine" the duplicates through a special function

A simple first-version function I hacked together does this:

def unique(source, cmp=cmp, key=None, combine=None):
     it = iter(source)
     first = True
     value = it.next()
     values = [value]
     while True:
         try:
             value = it.next()
             if key is not None:
                 cmp_result = cmp(values[0][key], value[key])
             else:
                 cmp_result = cmp(values[0], value)
             if cmp_result == 0:
                 values.append(value)
             else:
                 if combine is not None:
                     yield combine(values)
                 else:
                     yield values[0]
                 values = [value]
         except StopIteration:
             if combine is not None:
                 yield combine(values)
             else:
                 yield values[0]
             break
     raise StopIteration

Note that this function does not do any sorting so if the source that it 
gets the values from is not sorted, the result will be very wrong. This 
is again due to my criteria of being able to handle cursors retrieving 
data from a database and thus avoid loading everything into memory.

The combine function is passed a list of "duplicate" values and must 
return a value that will be yielded out of unique.

Example of usage:

def sum_counts(values):
     value = values[0][0]
     sum = 0
     for row in values:
         sum += row[1]
     return value, sum

fruits = [["Apple", 10], ["Apple", 15], ["Banana", 23], ["Orange", 17], 
["Orange", 17]]
for fruit, total_sum in unique(fruits, key=0, combine=sum_counts):
     print fruit, "has a sum of", total_sum

This will produce:

Apple has a sum of 25
Banana has a sum of 23
Orange has a sum of 34

Function name is perhaps not the best one. It occurs to me that this is 
the GROUP BY function in SQL so perhaps a different name is better, but 
then again this might be moot if such a function already exists somewhere :)

-- 
Lasse Vågsæther Karlsen
http://usinglvkblog.blogspot.com/
mailto:lasse at vkarlsen.no
PGP KeyID: 0x2A42A1C2



More information about the Python-list mailing list