[Tutor] Re: Unique Items in Lists
Brian van den Broek
bvande at po-box.mcgill.ca
Thu Jan 27 21:19:42 CET 2005
Kent Johnson said unto the world upon 2005-01-27 05:57:
> Brian van den Broek wrote:
>
>> Wolfram Kraus said unto the world upon 2005-01-27 03:24:
<SNIP>
>>> This whole part can be rewritten (without sorting, but in Py2.4 you
>>> can use sorted() for this) with a list comprehension (Old Python2.1
>>> style, with a newer version the keys() aren't needed):
>>> for k,v in [(k, items_dict[k]) \
>>> for k in items_dict.keys() if items_dict[k] > 1]:
>>> print '%s occurred %s times' %(key, items_dict[key])
>
>
> I think it is clearer to filter the list as it is printed. And
> dict.iteritems() is handy here, too.
>
> for k, v in items_dict.iteritems():
> if v > 1:
> print '%s occurred %s times' % (k, v)
>
> Kent
Hi all,
incorporating some of Wolfram's and Kent's (hope I've missed no one)
suggestions:
<code>
def dups_in_list_report(a_list):
'''Prints a duplication report for a list.'''
items_dict = {}
for i in a_list:
items_dict[i] = items_dict.get(i, 0) + 1
for k, v in sorted(items_dict.iteritems()): # cf below
if v > 1:
print '%s occurred %s times' %(k, v)
</code>
And, I can't but agree that this is much better! Thanks folks.
In trying to improve the code, I first had:
for key in sorted(items_dict.keys()):
if items_dict[key] > 1:
print '%s occurred %s times' %(key, items_dict[key])
in place of the for loop over .iteritems(). Am I right in thinking
that the advantage of Kent's suggestion of .iteritems() is that it
eliminates some of the dict lookups? Other advantages?
Finally, in the first instance, I was aiming for the OP's stated end.
To make this more general and reusable, I think I'd do:
<code>
def get_list_dup_dict(a_list, threshold=1):
'''Returns a dict of items in list that occur threshold many times
threshold defaults to 1. The dict returned has items occurring at
least
threshold many times as keys, and number of occurrences as values.
'''
items_dict, dup_dict = {}, {} # Question below
for i in a_list:
items_dict[i] = items_dict.get(i, 0) + 1
for k, v in items_dict.iteritems():
if v >= threshold:
dup_dict[k] = v #Question below
return dup_dict
def print_list_dup_report(a_list, threshold=1):
'''Prints report of items in a_list occurring at least threshold
many times
threshold defaults to 1. get_list_dup_dict(a_list, threshold=0)
is called.
returning a dict of items in list that occur at least threshold
many times
as keys and their number of repetitions as values.
This dict is looped over to print a sorted and formatted duplication
report.
'''
dup_dict = get_list_dup_dict(a_list, threshold)
for k, v in sorted(dup_dict.iteritems()):
print '%s occurred %s times' %(k, v)
</code>
My question (from comment in new code):
Since I've split the task into two functions, one to return a
duplication dictionary, the other to print a report based on it, I
think the distinct dup_dict is needed. (I do want the
get_list_dup_dict function to return a dict for possible use in other
contexts.)
The alternative would be to iterate over a .copy() of the items_dict
and delete items not meeting the threshold from items_dict, returning
the pruned items_dict at the end. But, dup_dict is guaranteed to be
smaller, save for original lists with no duplications and threshold
set to 1. So, the distinct dup_dict way seems better for memory.
Am I overlooking yet another dict technique that would help, here? Any
other improvements?
Thanks and best to all,
Brian vdB
More information about the Tutor
mailing list