Merging sorted lists/iterators/generators into one stream of values...
Lasse Vågsæther Karlsen
lasse at vkarlsen.no
Tue Oct 11 08:51:29 EDT 2005
Lasse Vågsæther Karlsen wrote:
> <snip>
>
> Another idea for this method would be that in some cases I noticed that
> it was useful to know which source each element would come from as well,
> as well as removing duplicates from the results.
>
<snip>
The "removing duplicates" problem would probably be best as a separate
function and it occurs to me that perhaps Python has such a function
already.
Again, this function would need the following criteria:
1. Be able to iterate through something other than a list
2. Yield the values, not return a list
3. Take an arbitrary cmp function to determine what is a duplicate
As sugar, perhaps also the following criteria:
- Ability to "combine" the duplicates through a special function
A simple first-version function I hacked together does this:
def unique(source, cmp=cmp, key=None, combine=None):
it = iter(source)
first = True
value = it.next()
values = [value]
while True:
try:
value = it.next()
if key is not None:
cmp_result = cmp(values[0][key], value[key])
else:
cmp_result = cmp(values[0], value)
if cmp_result == 0:
values.append(value)
else:
if combine is not None:
yield combine(values)
else:
yield values[0]
values = [value]
except StopIteration:
if combine is not None:
yield combine(values)
else:
yield values[0]
break
raise StopIteration
Note that this function does not do any sorting so if the source that it
gets the values from is not sorted, the result will be very wrong. This
is again due to my criteria of being able to handle cursors retrieving
data from a database and thus avoid loading everything into memory.
The combine function is passed a list of "duplicate" values and must
return a value that will be yielded out of unique.
Example of usage:
def sum_counts(values):
value = values[0][0]
sum = 0
for row in values:
sum += row[1]
return value, sum
fruits = [["Apple", 10], ["Apple", 15], ["Banana", 23], ["Orange", 17],
["Orange", 17]]
for fruit, total_sum in unique(fruits, key=0, combine=sum_counts):
print fruit, "has a sum of", total_sum
This will produce:
Apple has a sum of 25
Banana has a sum of 23
Orange has a sum of 34
Function name is perhaps not the best one. It occurs to me that this is
the GROUP BY function in SQL so perhaps a different name is better, but
then again this might be moot if such a function already exists somewhere :)
--
Lasse Vågsæther Karlsen
http://usinglvkblog.blogspot.com/
mailto:lasse at vkarlsen.no
PGP KeyID: 0x2A42A1C2
More information about the Python-list
mailing list