[Tutor] Summing part of a list

Tue May 9 18:54:09 CEST 2006

Matthew Webber wrote:
> I have a list that looks a bit like this -
>  
> [(u'gbr', 30505), (u'fra', 476), (u'ita', 364), (u'ger', 299),
> (u'fin', 6), (u'ven', 6), (u'chi', 3), (u'hun', 3), (u'mar', 3),
> (u'lux', 2), (u'smo', 2), (u'tch', 2), (u'aho', 1), (u'ber', 1)]
> 
> The list items are tuples, the first item of which is a country code, and
> the second of which is a numeric count. The list is guarenteed SORTED in
> descending order of the numeric count.
> 
> What I need is a list with all the members whose count is less than 3
> replaced by a single member with the counts added together. In this case, I
> want :
> [(u'gbr', 30505), (u'fra', 476), (u'ita', 364), (u'ger', 299),
> (u'fin', 6), (u'ven', 6), (u'OTHER', 17)]
> 
> Any ideas about neat ways to do this? The simplest way is to just build the
> new list with a basic loop over the original list. A slightly more
> sophisticated way is to split the original list using a list comprehension
> with an IF clause.
> 
> I have the feeling that there's probably really neat and more Pythonic way -
> there are possibilities like zip, map, itertools. Any hints about what to
> look at? Remember that the list is sorted already. If you can point me in
> the right direction, I'm sure I can work out the specifics of the code.

Hmm, must be generator day today. Here is a generator that does what you 
want:

In [1]: data = [(u'gbr', 30505), (u'fra', 476), (u'ita', 364), (u'ger', 
299),
    ...: (u'fin', 6), (u'ven', 6), (u'chi', 3), (u'hun', 3), (u'mar', 3),
    ...: (u'lux', 2), (u'smo', 2), (u'tch', 2), (u'aho', 1), (u'ber', 1)]

In [10]: def summarize(data):
    ....:     sum = 0
    ....:     othersFound = False
    ....:     for item in data:
    ....:             if item[1] > 3:
    ....:                 yield item
    ....:         else:
    ....:             sum += item[1]
    ....:             othersFound = True
    ....:     if othersFound:
    ....:             yield ('OTHER', sum)
    ....:

In [11]: print list(summarize(data))
[(u'gbr', 30505), (u'fra', 476), (u'ita', 364), (u'ger', 299), (u'fin', 
6), (u'ven', 6), ('OTHER', 17)]

In [12]: print list(summarize(data[:4]))
[(u'gbr', 30505), (u'fra', 476), (u'ita', 364), (u'ger', 299)]

The loop yields all the pairs that have values bigger than three. When 
the end of the loop is reached, if there were any residuals the 'OTHER' 
item is yielded. (If the values are always >0 you could get rid of the 
othersFound flag and just use sum as the flag.)

You could probably work out a solution using itertools.groupby() also, 
or you could use two list comps, one filtering for value>3 and one 
filtering for value < 3. This gives a one-line solution but it iterates 
the list twice. List comps are fast enough that this might actually be 
faster than the generator solution:

In [5]: [ item for item in data if item[1] > 3 ] + [('OTHER', 
sum([item[1] for item in data if item[1] <= 3]))]
Out[5]:
[(u'gbr', 30505),
  (u'fra', 476),
  (u'ita', 364),
  (u'ger', 299),
  (u'fin', 6),
  (u'ven', 6),
  ('OTHER', 17)]

Neither of these solutions rely on the list being sorted. In fact I 
originally wrote a generator that did rely on the list sort and it was 
longer than the one I show here!

HTH
Kent