[Tutor] Lists of duplicates

Thu Mar 9 00:22:00 EST 2017

On 03/08/2017 12:56 PM, Sri Kavi wrote:
> As part of my learning, I was participating in a discussion at:
> 
> 
> 
> https://discuss.codecademy.com/t/python-lists-of-duplicated-elements/78151
> 
> 
> 
> It’s about making a function that returns a list of lists, with each list
> being all of the elements that are the same as another element in the
> original list.
> 
> 
> 
> function([1, 2, 3, 4, 1, 1, 2, 3]) should return [[1, 1, 1], [2, 2], [3,
> 3], [4]]
> 
> 
> I wrote and submitted the following function:
> 
> def sublist_duplicates(lst):
>     sublists = []
> 
>     for item in set(lst):
>         sublists.append([item] * lst.count(item))
> 
>     return sublists
> 
> lst_of_duplicates = [1, 2, 10, 3, 4, 1, 's', 2, 3, 1, 4, 's']
> 
> print sublist_duplicates(lst_of_duplicates)
> 
> A participant told me that that'll grind into a near-halt if there are many
> different elements, for example range(1_000_000) (1 million unique values,
> so it checks the whole list (length 1 million) for the count, a million
> times, a total of 1000 billion operations…
> 
> 
> I thought he was right. I wrote another version of the function, this time
> using collections.Counter.
> 
> from collections import Counter
> 
> def sublist_duplicates(lst):
>     counts = Counter(lst)
>     sublists = [[k] * counts[k] for k in counts]
>     return sublists
> 
> _lst = list(range(1000000))
> 
> import timeit
> 
> time = timeit.timeit("sublist_duplicates(_lst)", setup = "from __main__
> import sublist_duplicates, _lst", number=1)
> 
> print("Number of elements in the list: {}".format(len(_lst)))
> print("Execution time of the function: {} seconds".format(round(time, 2)))
> 
> I found that the first version works better with a small list, but the
> second version outperforms the first one when it’s given a considerably big
> list of data.
> 
> 
> I’m asking for your comments on these two functions. I’m not even sure if
> this is how it should be coded.

Well, my thought was to go for simple. collections.Counter is really
designed for this, so why not?

First decompose problem:
1. count occurrence of elements in a list
2. emit a list of element/count pairs, each pair as a list
   repeating the element "count" times as long as count is
   at least two ("elements that are the same as another element")
3. decide if problem constraints warrant optimization

without getting into 3 (there were none in the problem statement),
here's what came to mind, looks a lot like your second solution:

import random
import collections

# generate some sample data to work on:
nums = [random.randint(1,10) for _ in range(50)]
print("list to process:", nums)

counted = collections.Counter(nums)
print("counts:", counted)

countlists = [[item] * counted[item] for item in counted if
counted[item] > 1]
print("countlists: ", countlists)

When I ran this I got:

list to process: [2, 6, 1, 2, 2, 5, 6, 6, 10, 2, 3, 8, 6, 6, 4, 2, 8, 2,
1, 1, 8, 4, 3, 5, 6, 6, 6, 4, 5, 1, 1, 8, 6, 3, 6, 10, 2, 7, 9, 4, 3, 6,
2, 4, 10, 4, 5, 1, 2, 7]
counts: Counter({6: 11, 2: 9, 1: 6, 4: 6, 3: 4, 5: 4, 8: 4, 10: 3, 7: 2,
9: 1})
countlists:  [[1, 1, 1, 1, 1, 1], [2, 2, 2, 2, 2, 2, 2, 2, 2], [3, 3, 3,
3], [4, 4, 4, 4, 4, 4], [5, 5, 5, 5], [6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6],
[7, 7], [8, 8, 8, 8], [10, 10, 10]]

notice value 9 missing from result line, as the count was only 1.