[Tutor] Re: Unique Items in Lists

Thu Jan 27 21:19:42 CET 2005

Kent Johnson said unto the world upon 2005-01-27 05:57:
> Brian van den Broek wrote:
> 
>> Wolfram Kraus said unto the world upon 2005-01-27 03:24:

<SNIP>

>>> This whole part can be rewritten (without sorting, but in Py2.4 you 
>>> can use sorted() for this) with a list comprehension (Old Python2.1 
>>> style, with a newer version the keys() aren't needed):
>>>       for k,v in [(k, items_dict[k]) \
>>>       for k in items_dict.keys() if items_dict[k] > 1]:
>>>           print '%s occurred %s times' %(key, items_dict[key])
> 
> 
> I think it is clearer to filter the list as it is printed. And 
> dict.iteritems() is handy here, too.
> 
> for k, v in items_dict.iteritems():
>   if v > 1:
>     print '%s occurred %s times' % (k, v)
> 
> Kent

Hi all,

incorporating some of Wolfram's and Kent's (hope I've missed no one) 
suggestions:

<code>
def dups_in_list_report(a_list):
     '''Prints a duplication report for a list.'''

     items_dict = {}

     for i in a_list:
         items_dict[i] = items_dict.get(i, 0) + 1

     for k, v in sorted(items_dict.iteritems()):   # cf below
         if v > 1:
             print '%s occurred %s times' %(k, v)
</code>

And, I can't but agree that this is much better! Thanks folks.

In trying to improve the code, I first had:

     for key in sorted(items_dict.keys()):
         if items_dict[key] > 1:
             print '%s occurred %s times' %(key, items_dict[key])

in place of the for loop over .iteritems(). Am I right in thinking 
that the advantage of Kent's suggestion of .iteritems() is that it 
eliminates some of the dict lookups? Other advantages?

Finally, in the first instance, I was aiming for the OP's stated end. 
To make this more general and reusable, I think I'd do:

<code>
def get_list_dup_dict(a_list, threshold=1):
     '''Returns a dict of items in list that occur threshold many times

     threshold defaults to 1. The dict returned has items occurring at 
least
     threshold many times as keys, and number of occurrences as values.
     '''

     items_dict, dup_dict = {}, {}   # Question below

     for i in a_list:
         items_dict[i] = items_dict.get(i, 0) + 1

     for k, v in items_dict.iteritems():
         if v >= threshold:
             dup_dict[k] = v	#Question below

     return dup_dict

def print_list_dup_report(a_list, threshold=1):
     '''Prints report of items in a_list occurring at least threshold 
many times

     threshold defaults to 1. get_list_dup_dict(a_list, threshold=0) 
is called.
     returning a dict of items in list that occur at least threshold 
many times
     as keys and their number of repetitions as values.

     This dict is looped over to print a sorted and formatted duplication
     report.
     '''

     dup_dict = get_list_dup_dict(a_list, threshold)
     for k, v in sorted(dup_dict.iteritems()):
         print '%s occurred %s times' %(k, v)
</code>

My question (from comment in new code):

Since I've split the task into two functions, one to return a 
duplication dictionary, the other to print a report based on it, I 
think the distinct dup_dict is needed. (I do want the 
get_list_dup_dict function to return a dict for possible use in other 
contexts.)

The alternative would be to iterate over a .copy() of the items_dict 
and delete items not meeting the threshold from items_dict, returning 
the pruned items_dict at the end. But, dup_dict is guaranteed to be 
smaller, save for original lists with no duplications and threshold 
set to 1. So, the distinct dup_dict way seems better for memory.

Am I overlooking yet another dict technique that would help, here? Any 
other improvements?

Thanks and best to all,

Brian vdB